Open EmileSonneveld opened 4 days ago
Requires that scala code makes the gdalinfo call, but also that we have a way to pass the resulting metadata back to the driver. This could perhaps be achieved by assembling the stac json files already in executors.
Now, gdalinfo is called on output assets in the driver. In case of gtiff output on S3, the assets where written on an executor, and need to get downloaded again in the driver. In case of fusemount it happens implicitly, in case of direct S3 access, it happens explicitly here: https://github.com/Open-EO/openeo-geopyspark-driver/blob/88ab283dde98209acbf47c5c4f10a1e92796ba85/openeogeotrellis/integrations/gdal.py#L177-L182
Moving gdalinfo to the executor and passing the info on would avoid this extra download.
This might avoid OOM like this: https://github.com/Open-EO/openeo-geopyspark-driver/issues/809 And would have avoided this log deadlock: https://github.com/Open-EO/openeo-geopyspark-driver/issues/906
cc @jdries