Open-EO / openeo-aggregator

openEO driver that combines multiple other drivers
https://open-eo.github.io/openeo-aggregator/
Apache License 2.0
1 stars 1 forks source link

crossbackend feature in aggregator: use NetCDF iso GTiff? #145

Open soxofaan opened 1 month ago

soxofaan commented 1 month ago

I noticed this while looking into https://github.com/Open-EO/openeo-geopyspark-driver/issues/786 related issue:

the crossbackend feature in aggregator currently uses GTiff for the load_stac bridge:

https://github.com/Open-EO/openeo-aggregator/blob/129d4f27ebf762c737d9b5229b88b6b49d1d9610/src/openeo_aggregator/partitionedjobs/crossbackend.py#L133-L141

If I remember correctly we picked that at the time of implementation, because it's a safe choice (widely supported) and there were issues with NetCDF support in load_stac in openeo-geopyspark-driver at the time (March 2023).

We might want to revisit the situation e.g. automatically detect a better option? let user choose in some way?

jdries commented 1 month ago

I'm not really sure if netcdf will be better, especially because writing a single large netcdf is not so easy, whereas geotiff can write multiple files in parallel. The only other format with some potential for this use case is Zarr, again because of the parallel write possibility.

soxofaan commented 1 month ago

A reason to prefer NetCDF is that it is more standardized to handle multidimensional cases (e.g. encode time dimension). With GTiff we do encoding of time dimension in a more ad-hoc way, so that will not scale well if more backend implementations come in play.

But indeed, this is not an urgent matter at this time

jdries commented 1 month ago

STAC + geotiff can fully define a datacube with time dimension in a standardized manner. In fact, the stac metadata becomes more complicated for netcdf with time dimension. I've also seen other backends write netcdf output in rather unexpected ways that we would probably not support on our side.