Closed jdries closed 8 months ago
Setting the 20000x20000 limit will fail some tests (eg: test_aggregate_feature_collection_heterogeneous ) I applied the check from https://github.com/Open-EO/openeo-geopyspark-driver/issues/320 before each sync call. But relaxed it x10 to not cause too much issues directly.
This example job will fail when launched sync:
{
"process_graph": {
"loadcollection1": {
"process_id": "load_collection",
"arguments": {
"id": "SENTINEL2_L2A",
"spatial_extent": {
"west": 4,
"east": 5,
"south": 50,
"north": 51
},
"temporal_extent": [
"2010-01-01",
"2030-01-10"
],
"bands": [
"B01",
"B02"
]
},
"result": true
}
},
"parameters": []
}
Errors: Requested extent for collection 'SENTINEL2_L2A' is too large to process. Estimated number of pixels: 1.21e+12, threshold: 1.00e+12.
@EmileSonneveld meaning this is solved?
Latest changes in openeo-python-driver seem to have caused a bunch of tests in openeo-geopyspark-driver to fail (in particular: tests.test_api_result
).
Note that this dependency https://github.com/Open-EO/openeo-geopyspark-driver/issues/618 was completed and tested.
Ukraine is 1400km wide. Requesting it with a 100m resolution layer would give a width of 14k pixels. Still within the limit. Requesting russia will be too large. No country in Europe should get till the limit. cc @mbuchhorn
It looks there is some confusion in this thread about how to count the number of pixels:
@soxofaan both are relevant. The first check purely on area is relevant, especially for sync jobs this shouldn't be too high. A second check can then do a guess of overall pixel count...
True, the is_layer_too_large
check is now done before each sync request. And an easy 20k pixel size test has been added too (only for sync requests)
the is_layer_too_large check
I guess this is part of the confusion: that function talks about "layer", which is a confusing term in openeo context: it sounds like a single band, but the intent of that function seems to be about estimating the size of the whole data cube (which it currently does at most very roughly), and now it also checks for the size of purely spatial extent
Would this name be better? is_requested_extent_too_large
or is_pixel_volume_too_large
Well I think is_layer_too_large
as a whole is a bit weird to work with and the wrong abstraction level,
I think it's cleaner and easier to work with if you have a function (or functions) to just estimate the size of a spatial extent, of a cube, of a process graph, ... and separate that from the logic that decides if a size is too large in some sense
I think it was originally implemented this way is because the pixel volume is only calculated on geometry when the cheap check is above the threshold. @JeroenVerstraelen
Some size check already existed before in the code: https://github.com/Open-EO/openeo-geopyspark-driver/blob/05e2a3cb4e9f212bf56a8f305c80146c5883c6e8/openeogeotrellis/backend.py#L2612 (It compares in km², not in pixel volume)
In layercatalog.json
, sometimes "step" (resolution) is defined in LatLon, sometimes in the layer native crs.
Most seem to be in layer native, OK to put them all in layer native?
For example SEASONAL_TRAJECTORIES
will go from "step": 0.0000992063492063
-> "step": 10
"extent" seems to always be in LatLon
note that layercatalog.json is roughly what we expose as collection metadata to the end user, so you should follow the specs that we want to follow. for example, with "step" I assume you are talking about "cube:dimensions" -> https://github.com/stac-extensions/datacube#dimension-object so the step size should be according to "reference_system" (which defaults to EPSG 4326). Or is that actually what you mean with "layer native crs"?
On Copernicus dataspace, I see synchronous requests of e.g. 20000x20000 pixels, which subsequently timeout after 15 minutes. It's probably better to block this up front, as these requests also use up resources and thus negatively impact performance of other requests.
We normally already have some infra in place to estimate number of pixels in the validation call.