Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

missing dates in load_stac of agera5_daily #802

Open bossie opened 2 weeks ago

bossie commented 2 weeks ago

Reported by Darius C.

This simple process graph for a year of daily observations should yield a netCDF for the entire time dimension (364 as it does not include upper temporal bound) but it only has 20; they seem to correspond to the first page returned by the STAC API i.e. [2021-10-30, 2021-10-11], the STAC API returns the items from new to old.

{
  "process_graph": {
    "loadstac1": {
      "process_id": "load_stac",
      "arguments": {
        "bands": [
          "2m_temperature_mean",
          "total_precipitation"
        ],
        "spatial_extent": {
          "west": 664000,
          "south": 5611120,
          "east": 665000,
          "north": 5612120,
          "crs": "EPSG:32631",
          "srs": "EPSG:32631"
        },
        "temporal_extent": [
          "2020-11-01",
          "2021-10-31"
        ],
        "url": "https://stac.openeo.vito.be/collections/agera5_daily"
      }
    },
    "saveresult1": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "loadstac1"
        },
        "format": "NetCDF",
        "options": {
          "format": "NetCDF"
        }
      },
      "result": true
    }
  }
}
bossie commented 2 weeks ago

The time coordinates in the netCDF indeed correspond directly to the page size requested from the STAC API; if I set it to e.g. limit=7, 7 dates end up in the netCDF, even though I can see 364 items are being passed on to the FileLayerProvider.

Manipulating the limit does allow me to narrow the time range a lot and still reproduce the bug.

bossie commented 2 weeks ago

The root cause is a quirk of STAC catalog: it will only return proj: metadata i.e. proj:epsg, proj:bbox and proj:shape for the first page of results; subsequent pages won't have them.

For example:

https://stac.openeo.vito.be/search?limit=7&bbox=5.318868004541495%2C50.628576059801816%2C5.3334400271343725%2C50.637843899562576&datetime=2020-11-01T00%3A00%3A00Z%2F2020-11-10T23%3A59%3A59.999000Z&collections=agera5_daily&fields=%2Bproperties.proj%3Abbox%2C%2Bproperties.proj%3Aepsg%2C%2Bproperties.proj%3Ashape

with a page size of 7 will return 7 items with proj: metadata, whereas the same query with page size of 3:

https://stac.openeo.vito.be/search?limit=3&bbox=5.318868004541495%2C50.628576059801816%2C5.3334400271343725%2C50.637843899562576&datetime=2020-11-01T00%3A00%3A00Z%2F2020-11-10T23%3A59%3A59.999000Z&collections=agera5_daily&fields=%2Bproperties.proj%3Abbox%2C%2Bproperties.proj%3Aepsg%2C%2Bproperties.proj%3Ashape

will return only 3, for the same 10 items.

I'm not sure yet at which point OpenEO decides to only consider those items with proj: metadata.

bossie commented 2 weeks ago

The load_stac implementation seems to assume that all assets either have proj: metadata, or none of them do: TBC.

bossie commented 2 weeks ago

@StijnCaerts could you look into the issue in https://stac.openeo.vito.be where page size has an influence on the STAC items themselves, as described above?

StijnCaerts commented 1 week ago

Seems like an issue with the URL encoding of the fields parameter. We already fixed this upstream in this: PR https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/issues/213 But we still have to upgrade our instances, which is being worked on right now.

bossie commented 1 week ago

@StijnCaerts would it make sense/is it possible to simply return all item properties if no fields param is specified? I couldn't find a clear answer on what field param a client (of a compliant STAC API) should provide to get back:

StijnCaerts commented 1 week ago

@bossie There is a related issue here: https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/issues/217

To get all the properties, you can use fields=+properties

jdries commented 1 week ago

@StijnCaerts in the followup, they seem to mention two relevant things:

Is it an option to update the config to return more fields by default? That could prevent issues later when we want to upgrade to the new version. Adding parameters is possible, but then we end up with things like 'if terrascope then add custom parameter'.