Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 3 forks source link

isSparse calculation takes too long for small RDDs #186

Closed JeroenVerstraelen closed 1 year ago

JeroenVerstraelen commented 1 year ago

When looking in the spark UI we can see that even for very small RDDs the stage related to createPartitioner can sometimes take up to 100 seconds (20 seconds in garbage collection). We should add a check to see if there are for example less than 10 keys to be processed. If that's the case, we can automatically use the SpaceTimeByMonthPartitioner.

JeroenVerstraelen commented 1 year ago

Issue #183 could be related.

jdries commented 1 year ago

the proposed fix is implemented, which reduces the number of jobs on tsservice a lot. I do believe number of partitions is now higher, and partitioning is not optimal with lots of empty tasks. We can probably derive an sparse partitioner based on dates gathered from catalog, without requiring a spark job.

jdries commented 1 year ago

done