Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

load_stac missing data when stiching two tiles #778

Closed VictorVerhaert closed 1 month ago

VictorVerhaert commented 1 month ago

When using load stac on the following collection: https://stac.openeo.vito.be/collections/tree_cover_density_2018 (job_id: j-2405173083f249c2bcc9c07be6e65416) I get the following missing data:

image

From the load_stac api call STAC API GET https://stac.openeo.vito.be/search?limit=20&bbox=11.1427023295687%2C47.22033843316067%2C11.821519349155245%2C47.628952581107114&datetime=1970-01-01T00%3A00%3A00Z%2F2069-12-31T23%3A59%3A59.999000Z&collections=tree_cover_density_2018&fields= i fethed the two matching tiff files (given other color for clarity): image

where you can see that the data from the red tile does exist but is not correctly loaded in. (the top line of the missing rectangle corresponds exactly to the dividing line of the two tiles.

used openeo code on CDSE:

spatial_extent = {'west': 11.1427023295687, 'south': 47.22033843316067, 'east': 11.821519349155245, 'north': 47.628952581107114}
landsat = connection.load_stac(
    "https://stac.openeo.vito.be/collections/tree_cover_density_2018",
    spatial_extent=spatial_extent,
).max_time().execute_batch("TCD.tif", title="TCD")
bossie commented 1 month ago

Pushed a quick fix that circumvents the problem in the case of load_stac, now looks like this on staging:

quickfix_scaled

bossie commented 1 month ago

The problem occurs in regions where two features meet and SpaceTimeKeys typically overlap both of these features.

In this case, the SpaceTimeKey (purple) is fullyContained within the bbox of the top feature (red) so only this GeoTiff asset will be taken into account and the bottom one discarded. Unfortunately the bbox does not match the actual footprint of the asset and the asset does not have data to fully cover the SpaceTimeKey: the gap.

initial_spacetimekey_fully_overlaps_bbox

bossie commented 1 month ago

@jdries is this optimization something that we want to have for load_stac as well (I'm assuming yes)? The quick fix I did essentially bypasses it for load_stac.

Otherwise, the real fix is twofold:

1) load_stac should consider a STAC Item's geometry rather than its bbox; this should not be hard to implement. 2) the geometries in the STAC Items in this collection do not match their asset's actual footprint so they will have to be fixed (reingested?).

Fixing the footprints will consider both assets and therefore remove the gap:

actual_footprints_spacetimekey_overlaps_both_assets

jdries commented 1 month ago

I'm not sure if we need the optimization: most products are generated without any overlap. The huge amount of overlap applied to sentinel-2 is rather the exception. In addition to that, we do the optimization for sentinel-2, because it is such a commonly used collection. For load_stac, it is probably better to be on the safe side and load a bit more data.

I do believe that we should consider fixing the footprints, and also using the geometry rather than bbox should be a good idea in general.

bossie commented 1 month ago

@VictorVerhaert is it Stijn C. that is responsible for https://stac.openeo.vito.be/collections/tree_cover_density_2018 or who should I bother?

VictorVerhaert commented 1 month ago

@bossie I made that collection myself. Can you clarify what is exactly wrong? Is it just the bbox that doesn't match?

bossie commented 1 month ago

@VictorVerhaert At least bbox and geometry, haven't checked proj:bbox and proj:geometry.

This item for example: https://stac.openeo.vito.be/collections/tree_cover_density_2018/items/TCD_2018_010m_E44N27_03035_v020

reports a geometry of:

{
  "type": "Polygon",
  "coordinates": [
    [
      [
        11.064548187608006,
        47.38783029804821
      ],
      [
        11.064548187608006,
        48.3083796083107
      ],
      [
        12.36948893966052,
        48.3083796083107
      ],
      [
        12.36948893966052,
        47.38783029804821
      ],
      [
        11.064548187608006,
        47.38783029804821
      ]
    ]
  ]
}

whereas I would expect it to be something like:

{
  "type": "Polygon",
  "coordinates": [
    [
      [
        11.046005504476401,
        47.40858428037738
      ],
      [
        11.707867449704809,
        47.40021736186508
      ],
      [
        12.36948893966052,
        47.38783030409527
      ],
      [
        12.390240820693707,
        47.837566260620925
      ],
      [
        12.411462626880093,
        48.28720072607632
      ],
      [
        11.738134164531402,
        48.29984134090657
      ],
      [
        11.064548187608006,
        48.30837961418922
      ],
      [
        11.055172953154765,
        47.85853023272656
      ],
      [
        11.046005504476401,
        47.40858428037738
      ]
    ]
  ]
}

such that it matches the actual footprint of the GeoTiff asset.

bossie commented 1 month ago

Disabled the optimization in case of load_stac (quick fix became real fix).

load_stac will take a STAC Item's geometry property into account as well (needs a recent openeo-opensearch-client).