eEcoLiDAR / infrastructure

Repository where all setup information about the infra-structure and examples are stored.
Apache License 2.0
0 stars 1 forks source link

Tmp directory gets full quickly when geotrellis-pointcloud reads entire directory #8

Open romulogoncalves opened 6 years ago

romulogoncalves commented 6 years ago

Geotrellis-point cloud uses system's /tmp/ to download the necessary LAZ files to execute a pipeline. Before reducing the number of files to be downloaded, we need to set this path to something else.

Solution:

val tmpDir_str :Option[String] = Option("/data/local/spark/tmp/")

val rdd_laz = HadoopPointCloudRDD(laz_path, options = HadoopPointCloudRDD.Options(pipeline = pipelineExpr, tmpDir = tmpDir_str))
romulogoncalves commented 6 years ago

Another tuning point, is to have a task per executor/worker. To do that we need to set that each task should take as many cores as the worker's cores, and for that we use spark.task.cpus.

spark.task.cpus     1     Number of cores to allocate for each task.

default value 1

Another solutions is to create a partition per LAS/LAZ file. A executor will only run one partition at the time.