Open clausmichele opened 3 years ago
Something happening in this matter ? We really need to filter by polygons, so i looked around a bit and
It seems like filter_spatial should do the job ? However, when i try to use it,
I noticed that in .../string_creation.py geometries
is recognized, but since it is a dict with a coordinates list and not a list itself, it gets ignored although it IS listed in 'default'.
At the bottom of this message, i add a complete example (where I patched the geometries manually)
However, even if i add the geometry by hand to the generated code, it does not work either.
In rioxarray:::::_internal_bounds
it crashes because the arrays x_dim
and y_dim
are empty:
So the question here is whether it should work this way or if there is some problems that we need to fix. (we are using the latest openeo-odc and pg-parser, but an older processes-python since the latter did not work for us (yet))
All the best /Daniel
-----------------------------------------------------------------
Complete example
------------------------------------------------------------------
process_graph = {
"process_graph": {
"loadcollection1": {
"process_id": "load_collection",
"arguments": {
"bands": ["oa04", "oa08"],
"id": "s3_olci_l2wfr",
"spatial_extent": {
"west": 16.933234243652887,
"south": 56.31269320304074,
"east": 18.0715952528406,
"north": 57.38588832427217,
},
"temporal_extent": ["2022-03-15", "2022-03-16"],
},
},
"filterspatial1": {
"process_id": "filter_spatial",
"arguments": {
"data": {"from_node": "loadcollection1_0"},
"geometries": {
"type": "Polygon",
"coordinates": [
[
[16.933234243652887, 56.31269320304074],
[18.0715952528406, 56.87459549103235],
[17.875326113325478, 57.36472525905839],
[17.286518694780106, 57.38588832427217],
[16.933234243652887, 56.31269320304074],
]
],
},
},
},
"saveresult1": {
"process_id": "save_result",
"arguments": {
"data": {"from_node": "filterspatial1_1"},
"format": "gtiff",
"options": {},
},
"result": True,
},
}
}
# ----------------------------------------------------------
# Generated Script
# ----------------------------------------------------------
import datacube
import openeo_processes as oeop
cube = datacube.Datacube()
_loadcollection1_0 = oeop.load_collection(
odc_cube=cube,
**{
"product": "s3_olci_l2wfr",
"dask_chunks": {"y": 12160, "x": 12114, "time": "auto"},
"x": (16.933234243652887, 18.0715952528406),
"y": (56.31269320304074, 57.38588832427217),
"time": ["2022-03-15", "2022-03-15"],
"measurements": ["oa04", "oa08"],
}
)
geometries = { #This one added by hand
"type": "Polygon",
"coordinates": [
(16.933234243652887, 56.31269320304074),
(18.0715952528406, 56.87459549103235),
(17.875326113325478, 57.36472525905839),
(17.286518694780106, 57.38588832427217),
(16.933234243652887, 56.31269320304074),
],
}
_filterspatial1_1 = oeop.filter_spatial(
**{"data": _loadcollection1_0, "geometries": geometries} #geometry added by hand
)
_saveresult1_2 = oeop.save_result(
**{"data": _filterspatial1_1, "format": "gtiff", "options": {}}
)
Looking at the processes-python repo, is above just a consquence of https://github.com/Open-EO/openeo-processes-python/issues/93 ?
Hi @danielFlemstrom. I think that currently the implementation of filter_spatial added by @ValentinaHutter in this commit https://github.com/Open-EO/openeo-processes-python/commit/2c3b9c7af64f60ee9c370a705132d447d2e28e95 does not mask the pixels outside the polygon.
ODC allows also polygons as query items, but it also doesn't mask out the pixels outside the polygon(s).
Concluding, the solution would be to implement mask_polygon https://processes.openeo.org/#mask_polygon
Just to add some information, a basic implementation of filter_spatial, which also masks out the data not in the polygons/geometries would look like this: https://github.com/SARScripts/openeo_odc_driver/blob/a5e36e5ed9b2ad6e81630c5ba2b0705b8ad023b2/openeo_odc_driver.py#L622
if processName == 'filter_spatial':
source = node.arguments['data']['from_node']
gdf = gpd.GeoDataFrame.from_features(node.arguments['geometries']['features'])
## Currently I suppose the input geometries are in EPSG:4326 and the collection is projected in UTM
gdf = gdf.set_crs(4326)
gdf_utm = gdf.to_crs(int(self.partialResults[source].spatial_ref))
## Clip the data and keep only the data within the polygons
self.partialResults[node.id] = self.partialResults[source].rio.clip(gdf_utm.geometry, drop=True)
However, it hasn't been officially implemented like this since it's not highly scalable.
I guess scalable is a desired property here... Maybe the mask_polygon
way may work for us. I guess you mean something like this:
It seems to be implemented already, but it crashes in
Traceback (most recent call last):
File "code.py", line 138, in <module>
_4_2 = oeop.save_result(**{"data": _3_1, "format": "GTIFF"})
... lots and lots of stuff :) ...
File "/user/ubuntu/.venv/lib/python3.8/site-packages/openeo_processes/cubes.py", line 1725, in insides
val = np.append(val, p.within(polygon))
NameError: free variable 'polygon' referenced before assignment in enclosing scope
Is this process under development by someone ? Or planned ?
Oh nice, I didn't know that mask_polygon was already there. It's either @LukeWeidenwalker or @ValentinaHutter that implemented it, let's wait for a comment from them.
The best would of course be if we do not need to specify a bounding box in load_collection (the mask process can maybe derive this anyway? Not sure how Dask handles this).
Specifying a bbox/spatial_extent in load_collection (i.e. in the ODC load process) in theory is not necessary, since the data is lazily loaded. However, generally the process is faster when you define a bbox in load_collection, since calling odc.load() requires less time.
Sample dummy test:
import datacube
from time import time
start_time = time()
query = {}
query['product'] = "S2_L2A_T32TPS"
datasets = dc.find_datasets(**query)
query['dask_chunks'] = {"time":-1,"x": 1000, "y":1000}
data = dc.load(datasets=datasets,**query)
print("Elapsed time no bbox: ", time()-start_time)
start_time = time()
query = {}
query['product'] = "S2_L2A_T32TPS"
spatial_extent = {'west': 11.345315, 'east': 11.347976, 'south': 46.493768, 'north': 46.495422}
query['y'] = (spatial_extent['north'],spatial_extent['south'])
query['x'] = (spatial_extent['west'],spatial_extent['east'])
datasets = dc.find_datasets(**query)
query['dask_chunks'] = {"time":-1,"x": 1000, "y":1000}
data = dc.load(datasets=datasets,**query)
print("Elapsed time bbox: ", time()-start_time)
Elapsed time no bbox: 34.441250801086426
Elapsed time bbox: 1.9410793781280518
So I guess it all boils down to get the mask_polygon to work, @LukeWeidenwalker or @ValentinaHutter, is there anything we at RISE can do to assist you ?
NameError
@danielFlemstrom reports in mask_polygon
is a bug for sure! This process was implemented by @ValentinaHutter (and I'm out-of-office next week), so she's probably best placed to suggest a fix. mask_polygon
process at all yet, but it seems to me that aggregate_spatial
and mask_polygon
should internally use the same logic, right? self.partialResults[source].rio.clip(gdf_utm.geometry, drop=True)
causes dask to execute all previous processes. Doing this for multiple geometries therefore repeats all that compute each time, we haven't found a way yet to make dask remember these intermediate results without running out of memory on the workers.
2) If you look in aggregate_spatial
, you will find an alternative approach that uses rasterio.features.geometry_mask
to build a mask in xarray and lazily (ish) applies that to the input raster. However, the mask is loaded into memory, which is a problem when dealing with large spatial extents coming from load_collection
. We haven't figured out how to chunk the mask such that this doesn't happen.
3) Optimising the process graph directly - load_collection
with a large spatial extent, followed by some compute (e.g. median), followed by a filter-type operation (filter_bbox
, aggregate_spatial
, mask_polygon
) on a much smaller polygon/subarea means that we're effectively loading/processing all that data, just to throw it away later. We haven't found a way to achieve the same lazily in Dask.
4) We also haven't seen success with directly parametrising the odc.load()
call with more than one polygon. In our experiments, this would just cause the bounding box of all these polygons to be loaded, so we just ran into the same problem with the mask being huge, as in approach 2)I think I agree that figuring out a scalable (lazily-evaluated) implementation for mask_polygon
is exactly what we're lacking to resolve this, so we'd be really keen on getting input on how to approach this!
Thanks @LukeWeidenwalker for the good summary!
- However, @ValentinaHutter @clausmichele I haven't worked with the
mask_polygon
process at all yet, but it seems to me thataggregate_spatial
andmask_polygon
should internally use the same logic, right?
Indeed! In aggregate_spatial we mask the data using a geometry and then apply a reducer, whereas mask_polygon would just do the first step for all the geometries and return a datacube with the same dimensions as the input one.
Sorry for the late reply, I've been out of office for the last two days.
So, concerning the implementation of filter_spatial
and the problem with openeo-odc
: I started working on these some time ago, before we have been working on the aggregate_spatial
process and before we had a process to load geometries
from URLs and allow the geometries
parameter to be a reference, too. As I updated the openeo-odc
to allow geometries
to be urls or references, I did not realise yet, that - because of this update - dicts do not work anymore. So I will update the openeo-odc
again to allow geometries
to be dicts
as well. Also I think it makes sense to update the filter_spatial
process in order to allow the geometries
to come from a url, too. So, this should not be too hard to fix. Thank you, for pointing this out!
The other problem we have is anyhow more complex and is just what @LukeWeidenwalker described above. I think that also with the current implementation of filter_spatial
as well as mask_polygon
we are having scalability issues. I also agree with Lukas that aggregate_spatial
and mask_polygon
should use the same logic and filter_spatial
should be implemented in a similar way.
I had already thought about aligning those processes but I think, this will only make sense once our work on aggregate_spatial
is finished. So we will come back to you once this is done.
That sounds great; we sit back and wait eagerly then. Please reach out to us if you need any assistance or harsh testing!
@ValentinaHutter How is things going with this issue? Is there any updates on the LoadCollection mapping ? It the polygon spatial extent is still not included in the generated code. Are you abandoning this repo as well (as the procecesses-python) ?
Hello! In https://github.com/Open-EO/openeo-odc/commit/0d59ac3d6c99011d9f41da4a850d1cb7456fee68 I made the change to allow geometries to be dicts.
But yes, we are currently refactoring our backend (https://github.com/Open-EO/openeo-processes-python/issues/198), and we aim to have a solution that does not include strings of python code, so we will not be maintaining the openeo-odc repo anymore.
Using a polygon in load_collection is not supported yet. I open this issue as a reminder.