Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 5 forks source link

Move gdalinfo call to executor #948

Open EmileSonneveld opened 4 days ago

EmileSonneveld commented 4 days ago

Now, gdalinfo is called on output assets in the driver. In case of gtiff output on S3, the assets where written on an executor, and need to get downloaded again in the driver. In case of fusemount it happens implicitly, in case of direct S3 access, it happens explicitly here: https://github.com/Open-EO/openeo-geopyspark-driver/blob/88ab283dde98209acbf47c5c4f10a1e92796ba85/openeogeotrellis/integrations/gdal.py#L177-L182

Moving gdalinfo to the executor and passing the info on would avoid this extra download.

This might avoid OOM like this: https://github.com/Open-EO/openeo-geopyspark-driver/issues/809 And would have avoided this log deadlock: https://github.com/Open-EO/openeo-geopyspark-driver/issues/906

cc @jdries

jdries commented 4 days ago

Requires that scala code makes the gdalinfo call, but also that we have a way to pass the resulting metadata back to the driver. This could perhaps be achieved by assembling the stac json files already in executors.