Open-EO / openeo-opensearch-client

Simple opensearch client for openeo.
Apache License 2.0
0 stars 0 forks source link

tileId pattern matches unwanted tiles #32

Closed EmileSonneveld closed 4 months ago

EmileSonneveld commented 8 months ago

Because we can only use asterisk wildcards on metadata we can get too many tiles passing the filter. As Marcel says: 26S*H will not only find the needed 26SKH and 26SLH tiles, but also the unneeded 26SHx

We could implement the "?" wildcard to match a single character. For example: 26S?H. Or maybe allow for more complex matching by using something like filter_labels on product metadata.

Related: https://github.com/Open-EO/openeo-opensearch-client/issues/25

mbuchhorn commented 8 months ago

next to adaptations of the wildcards or properties filters, it would be maybe good that the load_collection process can also handle a list of given tiles. e.g. sometimes the AOI is between two UTMzones but still only two Sentinel-2 tileIDs would be needed. So instead of a tileId with a wildcard it would be better to just pass a list of the specific tile names.

JeroenVerstraelen commented 4 months ago

Would be good if this also works on CDSE

JeroenVerstraelen commented 4 months ago

20x20 processing box, for each we have maximum 4 tile ids (most of the time 1 or 2 tiles). Would be better if we can give it a list of tileids instead of a wildstar.

mbuchhorn commented 4 months ago

you can find an example for the 20x20km processing boxes with their corresponding Sentinel-2 tileID's here: https://git.vito.be/projects/NCA/repos/extentmapping/browse/src/extentmapping/resources/LAEA-20km_add-info.gpkg

I will implement this extra filter in the properties part of here: https://git.vito.be/projects/NCA/repos/extentmapping/browse/src/extentmapping/openeo/preprocessing.py#288

jdries commented 4 months ago

The key point here is to find an openEO compatible way to specify the property filter.

lambda p: array_contains(data=["26SKH","26SLH"],value = p)

Note that wildcard matching is not part of the spec, but we can do things like:

lambda p: text_begins(data=p,pattern = "26S")

bossie commented 4 months ago

This should work on Terrascope:

properties = {"tileId": lambda tile_id: array_contains(["31UES", "31UFS"], tile_id)}

data_cube = (connection
             .load_collection("SENTINEL2_L2A", properties=properties)
             .filter_bbox(west=4.4158740490713804, south=51.4204485519121945, east=4.4613941769140322, north=51.4639210615473885)
             .filter_temporal(["2024-04-24", "2024-04-25"])
             .filter_bands(["B04", "B03", "B02"])
             .save_result("GTiff"))

data_cube.execute_batch()

Adding a bbox filter is recommended because filtering by tile ID happens client side.

bossie commented 4 months ago

There's still some technical debt to be addressed.

  1. the array_contains process is translated to an {"eq": ["31UES", "31UFS"]} criterion;
  2. the criterion's operator "eq" is then ditched (not passed on to Scala);
  3. the array in the criterion's value (["31UES", "31UFS"]) behaves as an OR (tileId == "31UES" || tileId == "31UFS")

This is all very implicit and a better way would be to retain the operator i.e. an "in".

bossie commented 4 months ago

Some notes re: the technical debt:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/630439c6170c9ec5668e8d815b9fe0818287f430/openeogeotrellis/layercatalog.py#L295-L310

JeroenVerstraelen commented 3 months ago

@mbuchhorn Your usecase should work now.