polygon in load_collection missing

clausmichele commented 3 years ago

Using a polygon in load_collection is not supported yet. I open this issue as a reminder.

danielFlemstrom commented 2 years ago

Something happening in this matter ? We really need to filter by polygons, so i looked around a bit and It seems like filter_spatial should do the job ? However, when i try to use it, I noticed that in .../string_creation.py geometriesis recognized, but since it is a dict with a coordinates list and not a list itself, it gets ignored although it IS listed in 'default'. At the bottom of this message, i add a complete example (where I patched the geometries manually)

However, even if i add the geometry by hand to the generated code, it does not work either. In rioxarray:::::_internal_bounds it crashes because the arrays x_dim and y_dim are empty:

So the question here is whether it should work this way or if there is some problems that we need to fix. (we are using the latest openeo-odc and pg-parser, but an older processes-python since the latter did not work for us (yet))

All the best /Daniel

-----------------------------------------------------------------
Complete example
------------------------------------------------------------------
process_graph = {
    "process_graph": {
        "loadcollection1": {
            "process_id": "load_collection",
            "arguments": {
                "bands": ["oa04", "oa08"],
                "id": "s3_olci_l2wfr",
                "spatial_extent": {
                    "west": 16.933234243652887,
                    "south": 56.31269320304074,
                    "east": 18.0715952528406,
                    "north": 57.38588832427217,
                },
                "temporal_extent": ["2022-03-15", "2022-03-16"],
            },
        },
        "filterspatial1": {
            "process_id": "filter_spatial",
            "arguments": {
                "data": {"from_node": "loadcollection1_0"},
                "geometries": {
                    "type": "Polygon",
                    "coordinates": [
                        [
                            [16.933234243652887, 56.31269320304074],
                            [18.0715952528406, 56.87459549103235],
                            [17.875326113325478, 57.36472525905839],
                            [17.286518694780106, 57.38588832427217],
                            [16.933234243652887, 56.31269320304074],
                        ]
                    ],
                },
            },
        },
        "saveresult1": {
            "process_id": "save_result",
            "arguments": {
                "data": {"from_node": "filterspatial1_1"},
                "format": "gtiff",
                "options": {},
            },
            "result": True,
        },
    }
}

# ----------------------------------------------------------
#                         Generated Script
# ----------------------------------------------------------
import datacube
import openeo_processes as oeop

cube = datacube.Datacube()

_loadcollection1_0 = oeop.load_collection(
    odc_cube=cube,
    **{
        "product": "s3_olci_l2wfr",
        "dask_chunks": {"y": 12160, "x": 12114, "time": "auto"},
        "x": (16.933234243652887, 18.0715952528406),
        "y": (56.31269320304074, 57.38588832427217),
        "time": ["2022-03-15", "2022-03-15"],
        "measurements": ["oa04", "oa08"],
    }
)
geometries = {        #This one added by hand
    "type": "Polygon",
    "coordinates": [
        (16.933234243652887, 56.31269320304074),
        (18.0715952528406, 56.87459549103235),
        (17.875326113325478, 57.36472525905839),
        (17.286518694780106, 57.38588832427217),
        (16.933234243652887, 56.31269320304074),
    ],
}

    _filterspatial1_1 = oeop.filter_spatial(
        **{"data": _loadcollection1_0, "geometries": geometries} #geometry added by hand
    )
_saveresult1_2 = oeop.save_result(
    **{"data": _filterspatial1_1, "format": "gtiff", "options": {}}
)

danielFlemstrom commented 2 years ago

Looking at the processes-python repo, is above just a consquence of https://github.com/Open-EO/openeo-processes-python/issues/93 ?

clausmichele commented 2 years ago

Hi @danielFlemstrom. I think that currently the implementation of filter_spatial added by @ValentinaHutter in this commit https://github.com/Open-EO/openeo-processes-python/commit/2c3b9c7af64f60ee9c370a705132d447d2e28e95 does not mask the pixels outside the polygon.

ODC allows also polygons as query items, but it also doesn't mask out the pixels outside the polygon(s).

Concluding, the solution would be to implement mask_polygon https://processes.openeo.org/#mask_polygon

clausmichele commented 2 years ago

Just to add some information, a basic implementation of filter_spatial, which also masks out the data not in the polygons/geometries would look like this: https://github.com/SARScripts/openeo_odc_driver/blob/a5e36e5ed9b2ad6e81630c5ba2b0705b8ad023b2/openeo_odc_driver.py#L622

            if processName == 'filter_spatial':
                source = node.arguments['data']['from_node']
                gdf = gpd.GeoDataFrame.from_features(node.arguments['geometries']['features'])
                ## Currently I suppose the input geometries are in EPSG:4326 and the collection is projected in UTM
                gdf = gdf.set_crs(4326)
                gdf_utm = gdf.to_crs(int(self.partialResults[source].spatial_ref))
                ## Clip the data and keep only the data within the polygons
                self.partialResults[node.id] = self.partialResults[source].rio.clip(gdf_utm.geometry, drop=True)

However, it hasn't been officially implemented like this since it's not highly scalable.

danielFlemstrom commented 2 years ago

I guess scalable is a desired property here... Maybe the mask_polygonway may work for us. I guess you mean something like this: It seems to be implemented already, but it crashes in

Traceback (most recent call last):
  File "code.py", line 138, in <module>
    _4_2 = oeop.save_result(**{"data": _3_1, "format": "GTIFF"})
... lots and lots of stuff :) ...
  File "/user/ubuntu/.venv/lib/python3.8/site-packages/openeo_processes/cubes.py", line 1725, in insides
    val = np.append(val, p.within(polygon))
NameError: free variable 'polygon' referenced before assignment in enclosing scope

Is this process under development by someone ? Or planned ?

clausmichele commented 2 years ago

Oh nice, I didn't know that mask_polygon was already there. It's either @LukeWeidenwalker or @ValentinaHutter that implemented it, let's wait for a comment from them.

danielFlemstrom commented 2 years ago

The best would of course be if we do not need to specify a bounding box in load_collection (the mask process can maybe derive this anyway? Not sure how Dask handles this).

clausmichele commented 2 years ago

Specifying a bbox/spatial_extent in load_collection (i.e. in the ODC load process) in theory is not necessary, since the data is lazily loaded. However, generally the process is faster when you define a bbox in load_collection, since calling odc.load() requires less time.

Sample dummy test:


import datacube
from time import time

start_time = time()
query = {}
query['product'] = "S2_L2A_T32TPS"
datasets  = dc.find_datasets(**query)
query['dask_chunks'] = {"time":-1,"x": 1000, "y":1000}
data = dc.load(datasets=datasets,**query)
print("Elapsed time no bbox: ", time()-start_time)

start_time = time()
query = {}
query['product'] = "S2_L2A_T32TPS"
spatial_extent = {'west': 11.345315, 'east': 11.347976, 'south': 46.493768, 'north': 46.495422}
query['y']  = (spatial_extent['north'],spatial_extent['south'])
query['x'] = (spatial_extent['west'],spatial_extent['east'])
datasets  = dc.find_datasets(**query)
query['dask_chunks'] = {"time":-1,"x": 1000, "y":1000}
data = dc.load(datasets=datasets,**query)
print("Elapsed time bbox: ", time()-start_time)

Elapsed time no bbox:  34.441250801086426
Elapsed time bbox:  1.9410793781280518

danielFlemstrom commented 2 years ago

So I guess it all boils down to get the mask_polygon to work, @LukeWeidenwalker or @ValentinaHutter, is there anything we at RISE can do to assist you ?

LukeWeidenwalker commented 2 years ago

The NameError @danielFlemstrom reports in mask_polygon is a bug for sure! This process was implemented by @ValentinaHutter (and I'm out-of-office next week), so she's probably best placed to suggest a fix.
However, @ValentinaHutter @clausmichele I haven't worked with the mask_polygon process at all yet, but it seems to me that aggregate_spatial and mask_polygon should internally use the same logic, right?
@danielFlemstrom We've so far found it quite challenging to mask raster-data with geometries in a scalable way. These are approaches we've tried: 1) Calling something like self.partialResults[source].rio.clip(gdf_utm.geometry, drop=True) causes dask to execute all previous processes. Doing this for multiple geometries therefore repeats all that compute each time, we haven't found a way yet to make dask remember these intermediate results without running out of memory on the workers. 2) If you look in aggregate_spatial, you will find an alternative approach that uses rasterio.features.geometry_mask to build a mask in xarray and lazily (ish) applies that to the input raster. However, the mask is loaded into memory, which is a problem when dealing with large spatial extents coming from load_collection. We haven't figured out how to chunk the mask such that this doesn't happen. 3) Optimising the process graph directly - load_collection with a large spatial extent, followed by some compute (e.g. median), followed by a filter-type operation (filter_bbox, aggregate_spatial, mask_polygon) on a much smaller polygon/subarea means that we're effectively loading/processing all that data, just to throw it away later. We haven't found a way to achieve the same lazily in Dask. 4) We also haven't seen success with directly parametrising the odc.load() call with more than one polygon. In our experiments, this would just cause the bounding box of all these polygons to be loaded, so we just ran into the same problem with the mask being huge, as in approach 2)

I think I agree that figuring out a scalable (lazily-evaluated) implementation for mask_polygon is exactly what we're lacking to resolve this, so we'd be really keen on getting input on how to approach this!

clausmichele commented 2 years ago

Thanks @LukeWeidenwalker for the good summary!

However, @ValentinaHutter @clausmichele I haven't worked with the mask_polygon process at all yet, but it seems to me that aggregate_spatial and mask_polygon should internally use the same logic, right?

Indeed! In aggregate_spatial we mask the data using a geometry and then apply a reducer, whereas mask_polygon would just do the first step for all the geometries and return a datacube with the same dimensions as the input one.

ValentinaHutter commented 2 years ago

Sorry for the late reply, I've been out of office for the last two days.

So, concerning the implementation of filter_spatial and the problem with openeo-odc: I started working on these some time ago, before we have been working on the aggregate_spatial process and before we had a process to load geometries from URLs and allow the geometries parameter to be a reference, too. As I updated the openeo-odc to allow geometries to be urls or references, I did not realise yet, that - because of this update - dicts do not work anymore. So I will update the openeo-odc again to allow geometries to be dicts as well. Also I think it makes sense to update the filter_spatial process in order to allow the geometries to come from a url, too. So, this should not be too hard to fix. Thank you, for pointing this out!

The other problem we have is anyhow more complex and is just what @LukeWeidenwalker described above. I think that also with the current implementation of filter_spatial as well as mask_polygon we are having scalability issues. I also agree with Lukas that aggregate_spatial and mask_polygon should use the same logic and filter_spatial should be implemented in a similar way. I had already thought about aligning those processes but I think, this will only make sense once our work on aggregate_spatial is finished. So we will come back to you once this is done.

danielFlemstrom commented 2 years ago

That sounds great; we sit back and wait eagerly then. Please reach out to us if you need any assistance or harsh testing!

danielFlemstrom commented 2 years ago

@ValentinaHutter How is things going with this issue? Is there any updates on the LoadCollection mapping ? It the polygon spatial extent is still not included in the generated code. Are you abandoning this repo as well (as the procecesses-python) ?

ValentinaHutter commented 2 years ago

Hello! In https://github.com/Open-EO/openeo-odc/commit/0d59ac3d6c99011d9f41da4a850d1cb7456fee68 I made the change to allow geometries to be dicts.

But yes, we are currently refactoring our backend (https://github.com/Open-EO/openeo-processes-python/issues/198), and we aim to have a solution that does not include strings of python code, so we will not be maintaining the openeo-odc repo anymore.

Open-EO / openeo-odc

polygon in load_collection missing #3