Closed m-mohr closed 5 years ago
This is a blocker as filtering by properties is required in Use Case 1, so at least EODC/TUW should want to get that working. @lforesta @tmistelbauer @claxn
This can probably be avoided by back-ends by implementing some kind of lazy loading/executing that only loads data whenever really required for processing.
We will definitely have a look at this, although I am not sure how to implement this when working with zipped files (in our archive, all Sentinel-1, 2 and 3 L1C files are .zip-files).
As @tmistelbauer explained, I am also not sure how to handle this properly regarding the EODC back-end. From a user point of view I would expect that a collection is loaded globally if nothing is specified further. The actual access to the data should happen in a lazy way, as mentioned by @m-mohr, which will be quite tough to implement at the back-ends. Filtering along the dimensions (time, space (x,y or lat,lon) and bands (e.g., ['VV', 'VH'] for S1 or ['B1', 'B2', 'B3', ...] for S2)) should then be done using the provided raster-cube functions. I would rather see the load_collection
as a process to filter the collection by metadata entries (orbit numbers, instrument mode, ...) than restricting any of the dimensions of the data cube. Nevertheless. I would keep the possibility to be able to also specify time, space and bands before loading the data, as @m-mohr has shown in the example above. For the TUW use case 1 I would need the load_collection
process to filter for instance the instrument mode. Therefore, it is essential to have at least the functionality to filter the collection by specific metadata entries.
@tmistelbauer Could you give an estimate? This blocks the release and deliverables all partners and the EU are waiting for. Not speaking about implementation, but discussing potential solutions to be described in the processes.
How did you handle filtering until now? I guess it would not be very different to also allow filtering while loading the collection? It's just in a different stage and would actually help you to avoid lazy loading, I guess. But in general, this issue is targeted towards ALL back-ends and I'm awaiting your comments!
For EODC: As we are focusing on STAC metadata, one solution could be to have STAC metadata alongside all ZIP files, build an index for it and query against it. That even allows you to implement your File Download API (Gunnar spoke once about it and seemed quite happy about the approach) with the STAC API. Having that would be great.
What we currently do in the VITO-GeoPySpark backend, is to push down the temporal/spatial filters rather aggressively. This has some implications, as it assumes that: filter_spatial(apply_udf(load_collection(...))) == apply_udf(filter_spatial(load_collection(...))) This is however the only way that we can make it work, to avoid loading the whole world. Note that we need to do a similar thing with viewing services: a user can create a WMTS service for the whole data cube without any filtering. In that case, we also push down some of the parameters provided by the client of the service to the place where the collection is loaded.
As for filtering a collection on other properties: we simply do not support that yet. Currently our datacubes consist of preingested data, that no longer has any of the original file properties. Do note that we try to avoid mixing data that should not be mixed. (For instance, we have a separate data cube for the Sentinel-1 ascending and descending passes.)
For me, the more specific filtering functions are currently easier to implement, as they are much more limited in scope compared to the property expressions.
@m-mohr currently (in v0.3), filtering is mainly based on metadata entries (we query against csw.eodc.eu), similar as @claxn stated:
I would rather see the
load_collection
as a process to filter the collection by metadata entries...
The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed. I don't really see a nice way around this - at least not when working with our L1C archive. If this issue can be solved with STAC, we will definitely have a look into it.
What we currently do in the VITO-GeoPySpark backend, is to push down the temporal/spatial filters rather aggressively. This has some implications, as it assumes that: filter_spatial(apply_udf(load_collection(...))) == apply_udf(filter_spatial(load_collection(...)))
A potential problem with this approach is that if you have a process graph where the user loads a collection and then splits up into two parallel lines where each filters distinct things (e.g. data for 2016 and 2018) and then merges back, that would probably not work, right? That's a somehow simplified example and there are ways around it, but I just want to highlight that this could lead to more problems.
This is however the only way that we can make it work, to avoid loading the whole world.
That's exactly why I opened this issue.
As for filtering a collection on other properties: we simply do not support that yet.
That's fine, just throw an error...
For me, the more specific filtering functions are currently easier to implement, as they are much more limited in scope compared to the property expressions.
We basically have two issues here:
@m-mohr currently (in v0.3), filtering is mainly based on metadata entries (we query against csw.eodc.eu)
The problem is, as described above, that filtering is defined on data cube and their dimension values, not on metadata. Filtering on metadata is currently somewhat available, but not very well defined with the problems discussed in this issue.
The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed.
Internally, right? That's fine, but the user shouldn't notice that.
If this issue can be solved with STAC, we will definitely have a look into it.
Well, at least it can help to standardize the API.
So now that I heard that Alex and Jeroen are not available next week for the telco, I don't know when to discuss this without shifting the whole API release by multiple weeks...
cc @tmistelbauer @jdries
The first thing I did now is to add a note in load_collection:
If
expression
is not specified this would imply that the whole data set is expected to be loaded. Due to the immensive size of several data sets this may be optimized by back-ends to only load the data that is actually required after evaluation subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extents and as a consequence also to a manageable size.
The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed.
Internally, right? That's fine, but the user shouldn't notice that.
Yes, only internally, the user doesn't notice this at all.
If this issue can be solved with STAC, we will definitely have a look into it.
Well, at least it can help to standardize the API.
@m-mohr any help with implementing STAC at EODC would be very welcome.
Please review our proposal at https://open-eo.github.io/openeo-api/v/0.4.0/processreference/#load_collection
You can now specify extents and bands while loading the collection, so filters won't be needed as often as before, leading to lighter process graphs. You can query against metadata properties, too (related issue https://github.com/Open-EO/openeo-api/issues/173) more easily than before, see the example in the process description.
@tmistelbauer We can discuss that once we have the new API version out. 👍
@m-mohr I really like this approach. This also makes it easier for us to implement limits for spatial and temporal queries, by e.g. raising errors/warnings. is the bounding box is to big.
Thanks!
Closing for now, but feel free to add your comments.
@jdries For the problems with the web services, please see https://github.com/Open-EO/openeo-api/issues/172
Background
We added the
expression
parameter in load_collection very last minute to reduce the amount of loaded data from a collection. It is not very well thought out yet. In previous versions we were simply applying filters to the full data cube, which is still possible. Below two examples both filtering temporally and spatially, one approach is using theexpression
parameter inload_collection
and the other approach is usingfilter_*
processes. The examples are all in JS style:load_collection expression
This approach is pretty cumbersome as the filter processes are only defined on data cubes and therefore these convenience processes can't be used directly. Nevertheless, in theory this should usually be faster than the filter approach as it doesn't need to load ALL data (see below).
filter approach
What bugs me with this approach is that the first process basically tells the back-end to load a data cube on native resolution for the whole world and all times. So everything that is available, just to discard most of the data in the next two processes. This can probably be avoided by back-ends by implementing some kind of lazy loading/executing that only loads data whenever really required for processing.
Next steps?
In principle these two approaches should give the same (or similar?) results, but the performance could hugely differ. Also, the implementation effort seems not to match very well.
So how do back-ends and users think about this? How could this be improved?
Lastly (but very important), we need to figure out how the expression could actually work. Passing the ID to the expression callback won't work very well, I guess. So we may additionally pass a proprietary index so that the back-end can actually identify for which entity to get a property value for (needs changes in
property
). Or we pass a whole structure of metadata to the expression callback, but in this case we need a new process to get an object property by name from an object.cc @edzer @aljacob @jdries @mkadunc @lforesta
Related issue: https://github.com/radiantearth/stac-spec/issues/413