Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

Synchronous reqeusts: block if too large #616

Closed jdries closed 8 months ago

jdries commented 10 months ago

On Copernicus dataspace, I see synchronous requests of e.g. 20000x20000 pixels, which subsequently timeout after 15 minutes. It's probably better to block this up front, as these requests also use up resources and thus negatively impact performance of other requests.

We normally already have some infra in place to estimate number of pixels in the validation call.

EmileSonneveld commented 9 months ago

Setting the 20000x20000 limit will fail some tests (eg: test_aggregate_feature_collection_heterogeneous ) I applied the check from https://github.com/Open-EO/openeo-geopyspark-driver/issues/320 before each sync call. But relaxed it x10 to not cause too much issues directly.

This example job will fail when launched sync:

{
  "process_graph": {
    "loadcollection1": {
      "process_id": "load_collection",
      "arguments": {
        "id": "SENTINEL2_L2A",
        "spatial_extent": {
          "west": 4,
          "east": 5,
          "south": 50,
          "north": 51
        },
        "temporal_extent": [
          "2010-01-01",
          "2030-01-10"
        ],
        "bands": [
          "B01",
          "B02"
        ]
      },
      "result": true
    }
  },
  "parameters": []
}

Errors: Requested extent for collection 'SENTINEL2_L2A' is too large to process. Estimated number of pixels: 1.21e+12, threshold: 1.00e+12.

jdries commented 9 months ago

@EmileSonneveld meaning this is solved?

bossie commented 9 months ago

Latest changes in openeo-python-driver seem to have caused a bunch of tests in openeo-geopyspark-driver to fail (in particular: tests.test_api_result).

JeroenVerstraelen commented 9 months ago

Note that this dependency https://github.com/Open-EO/openeo-geopyspark-driver/issues/618 was completed and tested.

EmileSonneveld commented 9 months ago

Ukraine is 1400km wide. Requesting it with a 100m resolution layer would give a width of 14k pixels. Still within the limit. Requesting russia will be too large. No country in Europe should get till the limit. image cc @mbuchhorn

soxofaan commented 9 months ago

It looks there is some confusion in this thread about how to count the number of pixels:

jdries commented 9 months ago

@soxofaan both are relevant. The first check purely on area is relevant, especially for sync jobs this shouldn't be too high. A second check can then do a guess of overall pixel count...

EmileSonneveld commented 9 months ago

True, the is_layer_too_large check is now done before each sync request. And an easy 20k pixel size test has been added too (only for sync requests)

soxofaan commented 9 months ago

the is_layer_too_large check

I guess this is part of the confusion: that function talks about "layer", which is a confusing term in openeo context: it sounds like a single band, but the intent of that function seems to be about estimating the size of the whole data cube (which it currently does at most very roughly), and now it also checks for the size of purely spatial extent

EmileSonneveld commented 9 months ago

Would this name be better? is_requested_extent_too_large or is_pixel_volume_too_large

soxofaan commented 9 months ago

Well I think is_layer_too_large as a whole is a bit weird to work with and the wrong abstraction level, I think it's cleaner and easier to work with if you have a function (or functions) to just estimate the size of a spatial extent, of a cube, of a process graph, ... and separate that from the logic that decides if a size is too large in some sense

EmileSonneveld commented 9 months ago

I think it was originally implemented this way is because the pixel volume is only calculated on geometry when the cheap check is above the threshold. @JeroenVerstraelen

EmileSonneveld commented 9 months ago

Some size check already existed before in the code: https://github.com/Open-EO/openeo-geopyspark-driver/blob/05e2a3cb4e9f212bf56a8f305c80146c5883c6e8/openeogeotrellis/backend.py#L2612 (It compares in km², not in pixel volume)

EmileSonneveld commented 9 months ago

In layercatalog.json, sometimes "step" (resolution) is defined in LatLon, sometimes in the layer native crs.

Most seem to be in layer native, OK to put them all in layer native? For example SEASONAL_TRAJECTORIES will go from "step": 0.0000992063492063 -> "step": 10

"extent" seems to always be in LatLon

soxofaan commented 9 months ago

note that layercatalog.json is roughly what we expose as collection metadata to the end user, so you should follow the specs that we want to follow. for example, with "step" I assume you are talking about "cube:dimensions" -> https://github.com/stac-extensions/datacube#dimension-object so the step size should be according to "reference_system" (which defaults to EPSG 4326). Or is that actually what you mean with "layer native crs"?