"Filtering" a collection vs. filtering a data cube

m-mohr commented 5 years ago

Background

We added the expression parameter in load_collection very last minute to reduce the amount of loaded data from a collection. It is not very well thought out yet. In previous versions we were simply applying filters to the full data cube, which is still possible. Below two examples both filtering temporally and spatially, one approach is using the expression parameter in load_collection and the other approach is using filter_* processes. The examples are all in JS style:

load_collection expression

function collectionFilter(b, collection_id) {
  // It's actually not clear yet which property to query for the timestamps and coordinates as these are not directly accessible from the STAC metadata. "temporal", "x" and "y" are invented by me (inspired by the dimension names).
  // Also, it doesn't seem right that we pass the collection id here instead of or in addition to some (proprietary?) index
  return b.and([
    b.between(b.property(collection_id, "temporal"), "2018-01-01", "2018-02-01", true),
    b.between(b.property(collection_id, "x"), 16.1, 16.6),
    b.between(b.property(collection_id, "y"), 47.2, 48.6)
  ]);
}

var datacube = builder.load_collection("Sentinel-2", collectionFilter);
// and so on...

This approach is pretty cumbersome as the filter processes are only defined on data cubes and therefore these convenience processes can't be used directly. Nevertheless, in theory this should usually be faster than the filter approach as it doesn't need to load ALL data (see below).

filter approach

var datacube = builder.filter_bbox(
    builder.load_collection("Sentinel-2"),
    {west: 16.1, east: 16.6, north: 48.6, south: 47.2}
);
datacube = builder.filter_temporal(datacube, "2018-01-01", "2018-02-01");
// and so on...

What bugs me with this approach is that the first process basically tells the back-end to load a data cube on native resolution for the whole world and all times. So everything that is available, just to discard most of the data in the next two processes. This can probably be avoided by back-ends by implementing some kind of lazy loading/executing that only loads data whenever really required for processing.

Next steps?

In principle these two approaches should give the same (or similar?) results, but the performance could hugely differ. Also, the implementation effort seems not to match very well.

So how do back-ends and users think about this? How could this be improved?

Lastly (but very important), we need to figure out how the expression could actually work. Passing the ID to the expression callback won't work very well, I guess. So we may additionally pass a proprietary index so that the back-end can actually identify for which entity to get a property value for (needs changes in property). Or we pass a whole structure of metadata to the expression callback, but in this case we need a new process to get an object property by name from an object.

cc @edzer @aljacob @jdries @mkadunc @lforesta

m-mohr commented 5 years ago

This is a blocker as filtering by properties is required in Use Case 1, so at least EODC/TUW should want to get that working. @lforesta @tmistelbauer @claxn

tmistelbauer commented 5 years ago

This can probably be avoided by back-ends by implementing some kind of lazy loading/executing that only loads data whenever really required for processing.

We will definitely have a look at this, although I am not sure how to implement this when working with zipped files (in our archive, all Sentinel-1, 2 and 3 L1C files are .zip-files).

claxn commented 5 years ago

As @tmistelbauer explained, I am also not sure how to handle this properly regarding the EODC back-end. From a user point of view I would expect that a collection is loaded globally if nothing is specified further. The actual access to the data should happen in a lazy way, as mentioned by @m-mohr, which will be quite tough to implement at the back-ends. Filtering along the dimensions (time, space (x,y or lat,lon) and bands (e.g., ['VV', 'VH'] for S1 or ['B1', 'B2', 'B3', ...] for S2)) should then be done using the provided raster-cube functions. I would rather see the load_collection as a process to filter the collection by metadata entries (orbit numbers, instrument mode, ...) than restricting any of the dimensions of the data cube. Nevertheless. I would keep the possibility to be able to also specify time, space and bands before loading the data, as @m-mohr has shown in the example above. For the TUW use case 1 I would need the load_collection process to filter for instance the instrument mode. Therefore, it is essential to have at least the functionality to filter the collection by specific metadata entries.

m-mohr commented 5 years ago

@tmistelbauer Could you give an estimate? This blocks the release and deliverables all partners and the EU are waiting for. Not speaking about implementation, but discussing potential solutions to be described in the processes.

How did you handle filtering until now? I guess it would not be very different to also allow filtering while loading the collection? It's just in a different stage and would actually help you to avoid lazy loading, I guess. But in general, this issue is targeted towards ALL back-ends and I'm awaiting your comments!

For EODC: As we are focusing on STAC metadata, one solution could be to have STAC metadata alongside all ZIP files, build an index for it and query against it. That even allows you to implement your File Download API (Gunnar spoke once about it and seemed quite happy about the approach) with the STAC API. Having that would be great.

jdries commented 5 years ago

What we currently do in the VITO-GeoPySpark backend, is to push down the temporal/spatial filters rather aggressively. This has some implications, as it assumes that: filter_spatial(apply_udf(load_collection(...))) == apply_udf(filter_spatial(load_collection(...))) This is however the only way that we can make it work, to avoid loading the whole world. Note that we need to do a similar thing with viewing services: a user can create a WMTS service for the whole data cube without any filtering. In that case, we also push down some of the parameters provided by the client of the service to the place where the collection is loaded.

As for filtering a collection on other properties: we simply do not support that yet. Currently our datacubes consist of preingested data, that no longer has any of the original file properties. Do note that we try to avoid mixing data that should not be mixed. (For instance, we have a separate data cube for the Sentinel-1 ascending and descending passes.)

For me, the more specific filtering functions are currently easier to implement, as they are much more limited in scope compared to the property expressions.

tmistelbauer commented 5 years ago

@m-mohr currently (in v0.3), filtering is mainly based on metadata entries (we query against csw.eodc.eu), similar as @claxn stated:

I would rather see the load_collection as a process to filter the collection by metadata entries...

The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed. I don't really see a nice way around this - at least not when working with our L1C archive. If this issue can be solved with STAC, we will definitely have a look into it.

m-mohr commented 5 years ago

What we currently do in the VITO-GeoPySpark backend, is to push down the temporal/spatial filters rather aggressively. This has some implications, as it assumes that: filter_spatial(apply_udf(load_collection(...))) == apply_udf(filter_spatial(load_collection(...)))

A potential problem with this approach is that if you have a process graph where the user loads a collection and then splits up into two parallel lines where each filters distinct things (e.g. data for 2016 and 2018) and then merges back, that would probably not work, right? That's a somehow simplified example and there are ways around it, but I just want to highlight that this could lead to more problems.

This is however the only way that we can make it work, to avoid loading the whole world.

That's exactly why I opened this issue.

As for filtering a collection on other properties: we simply do not support that yet.

That's fine, just throw an error...

For me, the more specific filtering functions are currently easier to implement, as they are much more limited in scope compared to the property expressions.

We basically have two issues here:

When to limit the loaded data A. During load_collection B. Sometime after load_collection
How to express the limits? A. For 1A: With very generic processes on metadata B. For 1A: With yet to introduce pre-defined processes such as select_temporal(?) working on metadata. (Or we add a new data type to the filter processes.) C. For 1B: With the filter* processes, but that is currently only defined on (already loaded) data cubes!

@m-mohr currently (in v0.3), filtering is mainly based on metadata entries (we query against csw.eodc.eu)

The problem is, as described above, that filtering is defined on data cube and their dimension values, not on metadata. Filtering on metadata is currently somewhat available, but not very well defined with the problems discussed in this issue.

The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed.

Internally, right? That's fine, but the user shouldn't notice that.

If this issue can be solved with STAC, we will definitely have a look into it.

Well, at least it can help to standardize the API.

So now that I heard that Alex and Jeroen are not available next week for the telco, I don't know when to discuss this without shifting the whole API release by multiple weeks...

cc @tmistelbauer @jdries

m-mohr commented 5 years ago

The first thing I did now is to add a note in load_collection:

If expression is not specified this would imply that the whole data set is expected to be loaded. Due to the immensive size of several data sets this may be optimized by back-ends to only load the data that is actually required after evaluation subsequent processes such as filters. This means that the pixel values should be processed only after the data has been limited to the required extents and as a consequence also to a manageable size.

tmistelbauer commented 5 years ago

The result of this filter is at the moment not a raster-cube, but a filelist that is then further processed.

Internally, right? That's fine, but the user shouldn't notice that.

Yes, only internally, the user doesn't notice this at all.

If this issue can be solved with STAC, we will definitely have a look into it.

Well, at least it can help to standardize the API.

@m-mohr any help with implementing STAC at EODC would be very welcome.

m-mohr commented 5 years ago

Please review our proposal at https://open-eo.github.io/openeo-api/v/0.4.0/processreference/#load_collection

You can now specify extents and bands while loading the collection, so filters won't be needed as often as before, leading to lighter process graphs. You can query against metadata properties, too (related issue https://github.com/Open-EO/openeo-api/issues/173) more easily than before, see the example in the process description.

@tmistelbauer We can discuss that once we have the new API version out. 👍

tmistelbauer commented 5 years ago

@m-mohr I really like this approach. This also makes it easier for us to implement limits for spatial and temporal queries, by e.g. raising errors/warnings. is the bounding box is to big.

m-mohr commented 5 years ago

Thanks!

Closing for now, but feel free to add your comments.

@jdries For the problems with the web services, please see https://github.com/Open-EO/openeo-api/issues/172

Open-EO / openeo-processes