Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

keep band names after aggregate_spatial #723

Closed VictorVerhaert closed 4 days ago

VictorVerhaert commented 3 months ago

aggregate_spacial now replaces band names with derived names (e.g. B02_mean -> percentile_approx(band_0, 0.5, 100000) when using median as reducer). An alternative to fixing this issue is adding the rename_labels process to vectorcubes so users can manualy rename the bands.

soxofaan commented 3 months ago

The aggregate_spatial process spec indeed does not mention changing band names, so we should not use these derived band names by default.

However, I think this is not done from openeo-python-driver, but from openeo-geopyspark-driver or even the geotrellis extensions

VictorVerhaert commented 1 month ago

In the jobresult metadata, the band names are still present. My test case was with exporting to a csv, where the column names are derived. job_id: j-240507e5933c471bb7f06896f454e533 on CDSE.

jdries commented 4 weeks ago

Note that you can only keep the band names when only one statistic is computed. In the case of computing multiple stats, we are forced to create new band names, to keep them unique.

Code that sets current band names is here: https://github.com/Open-EO/openeo-geotrellis-extensions/blob/e9ebf2118b77220d228cbf336656929b8eeac753/openeo-geotrellis/src/main/scala/org/openeo/geotrellis/aggregate_polygon/AggregatePolygonProcess.scala#L323 https://github.com/Open-EO/openeo-geotrellis-extensions/blob/e9ebf2118b77220d228cbf336656929b8eeac753/openeo-geotrellis/src/main/scala/org/openeo/geotrellis/aggregate_polygon/AggregatePolygonProcess.scala#L219

Changing it there would change it in the csv file generated by spark, which in case of csv output is also passed on as-is I believe. The metadata on python side still holds the previous band names, so the vectorcube still knows the old names.

jdries commented 4 weeks ago

I added a branch showing a fix for csv. The issue may however not be fully accurate, in the sense that it is output format specific: csv -> band names from spark netcdf -> tries to use band names from python metadata parquet -> constructs own band names here: https://github.com/Open-EO/openeo-python-driver/blob/413c7c0a7070590dce491102f4f0a1cc5b8c0ce2/openeo_driver/save_result.py#L568

jdries commented 4 weeks ago

Committed a fix for parquet, which is normally the most relevant format for worldcereal.

jdries commented 3 weeks ago

@VincentVerelst can you do a check for parquet files on openeo-dev or cdse staging? Should be better for that file format.

VincentVerelst commented 3 weeks ago

@jdries confirmed that the resulting parquet files now retain the original band names on CDSE staging.

jdries commented 1 week ago

I now also improved it for CSV (still in build pipelines). When only one reducer is specified, band names will be preserved as-is. (Most common case.) In case of multiple reducers, it will be something like mean(B01), min(B01), mean(B02),min(B02) which is still an improvement over previous approach. rename_labels after aggregate_spatial will not yet have an effect on csv output, that only works for parquet and netcdf for now.