Use StreamRecordReaders instead of BinaryRecordReaders to workaround 2.5G files limit

geotrellis / geotrellis-pointcloud

GeoTrellis PointCloud library to work with any pointcloud data on Spark

Apache License 2.0

26 stars 10 forks source link

Use StreamRecordReaders instead of BinaryRecordReaders to workaround 2.5G files limit #11

Closed pomadchin closed 6 years ago

pomadchin commented 6 years ago

The problem of the current BinaryFileReader that it copies data into Array[Byte], we can read everything as a stream to workaround this problem. Not sure how PDAL handles it and how it would work with PDAL Java in memory classes => it requires some testing.

@romulogoncalves can you check this solution?

Closes #10

romulogoncalves commented 6 years ago

@pomadchin, thanks for the quick fix.

I will test it today or tomorrow.

romulogoncalves commented 6 years ago

In a 32GB machine, giving 30 GB to the executor all the memory is consumed for a 2.7GB LAZ file (500M points) and then it exits with error, i.e., executor runs out of resources. I need to debug to understand what is the real reason for the operation to be aborted.

The output on our Jupyter notebook: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 145.100.58.117, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

pomadchin commented 6 years ago

@romulogoncalves Have you tried on some small regular laz / las files? Also I'm wondering how PDAL would work with such big files. Also it looks like you need to consider PDAL filters usage to reduce amount of data loaded into JVM memory (for sure if you're not doing it) and to select dimensions properly (to load only important data into memory):

HadoopPointCloudRDD(
  path,
  HadoopPointCloudRDD.Options.DEFAULT.copy(dimTypes = Option(List("X", "Y", "Z")))
)

romulogoncalves commented 6 years ago

@pomadchin Yes, I started testing with a small file, then with a directory of 4 files where the biggest one was 1.2 GB and it worked. Then I decided to test with a large file and it is where I get the crash. I just realized that you did a commit after my comment this afternoon, I decided to do a pull again and start from scratch.

Btw, I am using branch: git checkout feature/outputstreamformat and then I do sbt assembly and then I use the produces jar.

To read the data I do:

val pipelineExpr = Read("local") ~ CropFilter(polygon = polygon_str) ~ HagFilter()  
val rdd_laz = HadoopPointCloudRDD(laz_path, options = HadoopPointCloudRDD.Options(pipeline = pipelineExpr, tmpDir = tmpDir_str))

The crop reduces by 50% the number of points, but maybe I need to reduce it even more. I only need X,Y,Z and Classification. Do you advise to only request dimTypes = Option(List("X", "Y", "Z", "Classification")?

pomadchin commented 6 years ago

@romulogoncalves yes. it will load only 4 dimensions into JVM memory; there is a chance that it would work (as it would use much less JVM mem). However you experience and aims indeed make the quesion https://github.com/geotrellis/geotrellis-pointcloud/issues/12 very reasonable.

romulogoncalves commented 6 years ago

@pomadchin I decided to crop a smaller area and then it all went fine. It is now able to read large LAZ files and crop them before loading the points into the JVM. The only issue is performance, one a single core doing the job and we are not able to manage large amount of points, i.e., I need to crop a small area to be able to work.

We think geotrellis-pointcloud should be able to read this file because after cropping 50% of the points, i.e., 250 M points, and store them in 3 doubles and 1 short takes around 7GB of space in memory which is not much for 30GB executor. Hence, no idea why we run out of memory. LasZip might be the reason, it often takes a lot of resources, I will test the read with lasPerf.

Anyway, this bug was about being able to read large LAZ files and it works.