aggregate_spatial with single large geometry: bad partitioning

Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.

Apache License 2.0

5 stars 4 forks source link

aggregate_spatial with single large geometry: bad partitioning #326

Closed jdries closed 1 month ago

jdries commented 1 month ago

User is doing an aggregate_spatial over time, but with only one feature. At some point, the RDD size seems to be only 13MB, so spark decides that one partition should be sufficient. This does not appear to be the case, because even with 6GB executor memory, there's a lot of GC, and the task takes forever.

jdries commented 1 month ago

It looks like .coalesce(1) always forces a single partition at the end, ignoring other things like tunables in spark conf. Hence, a potential solution is in allowing more output files, preferably in a better format.

jdries commented 1 month ago

it actually works better now, nodata filtering did it.