Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 3 forks source link

readperproduct triggers unnecessary stages #313

Closed jdries closed 3 weeks ago

jdries commented 1 month ago

To partition the rdd by source name, the keys are retrieved from the RDD, but this triggers full computation, which is even partially repeated later on, because not all stages are reusable. In a particular job, computing those distinct keys seems to take a full cpu hour (across all tasks).

Extra benefit of avoiding this is fewer spark jobs and stages, so simplifying the UI a bit.

jdries commented 1 month ago

Made a few improvements:

Detected other things:

Need to check: