CRS as dimension in data cubes

kempenep commented 4 years ago

@m-mohr I think we already had several discussions on coordinate reference system (crs). In my opinion, there should be a crs parameter in the load_collection. Not only for selecting the region of interest as we do have now, but for defining the target reference system of the resulting cube. Once input images with different crs are selected and need to be combined, we need to be able to re-project them in a single cube. I think it is also best to define the crs as soon as possible in the work flow, which is in the load_collection process, I guess.

m-mohr commented 4 years ago

This issue has a long history, see issues https://github.com/Open-EO/openeo-api/issues/4, https://github.com/Open-EO/openeo-api/issues/28, https://github.com/Open-EO/openeo-api/issues/29 and https://github.com/Open-EO/openeo-api/issues/89. (edit: fixed links to issues)

Currently, load_collection already covers filter_bbox, filter_temporal, filter_bands and filter. In addition, it would then also cover resample_spatial with additional 4 parameters. Reason is that it's not just the CRS that needs to match, but also the resolution (and maybe more things?). It is likely that we overload load_collection and it's getting to complicated.

So potential solutions are:

Add the 4 additional parameters from resample_spatial to load_collection
Expose collections with just a single CRS and/or expose collections matching all potential CRS for the collection.
Somehow get this information from the process graph (i.e. resample or save_result calls, but this might be ambiguous).
...?

I must admit I don't like any of the options yet. Ideas? Thoughts?

kempenep commented 4 years ago

I believe the spatial (and temporal) resolution need to be defined before actually loading a collection. Even a single collection such as Sentinel-2 can be stored in multiple projection (UTM) zones. Selecting a bounding box on the border of different UTM zones is ambiguous if not defining a target projection. I would be in favor of solution 1 (but I think we should also add the parameters for temporal resolution)

m-mohr commented 4 years ago

We need to decide on the call on Thursday whether a process with 10+ parameters is desirable and what alternatives we have. I just covered spatial above, but temporal is indeed also an issue. Also, does this also apply to a process such as load_results or load_user_data (see Open-EO/openeo-processes#83)?

lforesta commented 4 years ago

I am not in favor of a 10+ parameters process in general.

Resampling/reprojecting "early" may not always be the best option, since one may end up doing that operation for several hundreds or maybe thousands of timestamps (rasters). Instead the reprojection/downsampling can be potentially done at the end, it should be the user deciding when/if to do that. And I think the user now has all the capabilities to work with the projecton and spatial/temporal sampling that they want. If the user needs to have data in projection EPSG:xxxx, they simply use resample_spatial and/or resample_temporal just after load_collection and before applying any other process. Are there cases which are not covered by this approach?

kempenep commented 4 years ago

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step. To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

aljacob commented 4 years ago

In my opinion, collections should just have a single projection. I still think of the more scientific users, when thinking of designing this and they should be themselves able to decide if the want to do re-projection early or late. I would keep that separate that from load collection.

Not sure if we can define a construct, where somebody can prior to load collection define something like a target grid and hand that over to load collection.

kempenep commented 4 years ago

A construct prior to load_collection that defines the target grid would be an option indeed. The same could apply for the target resolution in both time and space.

m-mohr commented 4 years ago

What's the difference of such a construct compared to additional parameters in load_collection? I think it would make things even more complex than the additional parameters...

I'm thinking it may even be required to separate scientific users and "i don't care about projection non-remote senser" users and cater them with different collections. There could be the "EURAC" approach with several collections based on projection, resolution, etc. and then a separate "ready-to-use" collection as you get them usually in GEE for example. In this way it's up to the back-end to provide the best solution for their target audience and not really an API/processes issue any longer.

jdries commented 4 years ago

I think I agree with @lforesta that we can already do a lot with the current resample process. What we do additionally in our backend, is trying to 'push-down' certain parameters from higher-level processes, or deriving them from the requested type of output. For instance, when creating a viewing service in web mercator, we know we can start with loading all data in web mercator. Similarly, the bounding box could be limited to a given UTM zone, allowing you to decide to work in that zone. Or more explicitly, a resample operation done at some point in the graph could also be applied in the beginning, especially when it comes right after the 'load_collection' step. So in my opinion, it is up to the backend how to optimally execute the graph, as long as the result is correct.

lforesta commented 4 years ago

In my opinion, collections should just have a single projection. @aljacob

We can't enforce this in any way, a back-end will store data with the projection(s) they prefer for their purpose. In our case, for S2 L1C data, we don't reproject TBs of data to a common projection, but we do for some higher level data. This should be clear in the collection's description though, so the user will have this piece of information and will decide if and when to reproject/resample in the pg.

m-mohr commented 4 years ago

Well, you could enforce it by splitting into as many collections as there are projections. That's what EURAC is doing currently. That's probably not very user-friendly for people who are coming from GEE and are used fully reprojected datasets with a single projection, but scientific users are probably happy about having full control.

The underlying issue is: If you have for example the full Sentinel 2 archive with it's dozens of UTM projections and you load that into a data cube that is expected to have a single projection. How to load that data cube? Currently, for openEO it's either the "EURAC" or the "GEE" way. The question is whether there's a better alternative, e.g. the 10+ parameter load_collection process or ...?

jdries commented 4 years ago

Dealing with the UTM zones is implementation specific imo, for now a lot of backends indeed rely on reprojecting to an intermediate projection, but nothing is stopping them from doing something more advanced. If we force the user to specify a projection, than this will also block backends from being more innovative/creative when it comes to projection handling, as the API will inforce reprojecting.

edzer commented 4 years ago

From the documentation: Loads a collection from the current back-end by its id and returns it as processable data cube. I understand this like this:

image collections may consist of images registered at different CRS (like Sentinel 2 over different UTM zones)
load_collection returns a data cube, which is still only a view on the data (it doesn't need to write it, so to speak, as a new GeoTIFF or so at load time)
Since (spatial) dimensions of a cube require a unique mapping from pixel space to coordinates, this cube needs to have a single CRS
at time of load_collection, a resolution is not specified since we may want to browse results using a WMTS, and compute pixels at increasing/decreasing resolution and changing extent on-the-fly
at time of downloading result, a resolution is required
when CRS of cube and image collection are identical (like Jeroen's example of working in a single UTM zone) the native resolution of the image collection should be used when no resolution is specified (no resampling).

m-mohr commented 4 years ago

@edzer All true, but you did not cover the case we are discussion here: What happens if the image collection may consist of images registered at different CRS and you load it into a single CRS data cube? That is somewhat undefined at the moment.

We already had this on our agenda several times, but at some point it got lost, see https://github.com/Open-EO/openeo-api/issues/4, https://github.com/Open-EO/openeo-api/issues/28, https://github.com/Open-EO/openeo-api/issues/29 and https://github.com/Open-EO/openeo-api/issues/89

kempenep commented 4 years ago

Since a cube needs to have a single CRS, I am in favor of letting the user decide the CRS when (or before by setting some variable) creating a cube. I agree there can be some "smart" defaults that are well documented, e.g., 3857 for WMTS, or keeping the original projection of a tile as stored in the resp. back-end when a requested region of interest does not extend more than one projection zone... We have to be aware though this can lead to non-transparent differences between the results of the various back-end.

aljacob commented 4 years ago

When we had the meeting with guys from OGC in London, for the API Hackathon, this issue was also discussed. You could also define a collection of collections, covering the case where different sub-collections have different projections and tackle this using the virtual cube design discussed by @edzer and @kempenep, having a default mechanism of the backend describing this with a common global CRS like EPSG:3857 or EPSG:4326. How this virtual cube is created is of course up to the backend, and should allow for innovative solutions as proposed by @jdries .

m-mohr commented 4 years ago

@aljacob How is that different from what we have now? It seems to be the same, except that we hide sub-collections and don't have a way to specify the "global" CRS (which is what we are discussing here). "How this virtual cube is created is of course up to the backend" => That is the current implicit behaviour of load_collection.

edzer commented 4 years ago

Which CRS: what is wrong with taking the CRS of the spatial_extent as the CRS for the cube returned?

"How this virtual cube is created is of course up to the backend" : only to the extent that when you download the result, you'd like to have an unambiguous (reproducible) result.

kempenep commented 4 years ago

@edzer I support this idea to take the CRS of the spatial_extent as the CRS for the cube returned. This allows the user to define the CRS and avoids adding a new parameter.

aljacob commented 4 years ago

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac. The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg). So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable. @m-mohr it is not different, was just to illustrate how it could be setup in a specific back-end. I also agree, that we can do all of this already with the existing processes. Using load_collection, resample_cube_spatial and merge_cubes, should give a user every possibility to fine-control this.

m-mohr commented 4 years ago

The spatial extent is always given in EPSG:4326 aka wgs84 according to stac.

This is a misunderstanding. As far as I understood @edzer, he meant the spatial extent specified in load_collection.

I guess going the route via the spatial_extent CRS could be an option, although it has two drawbacks:

It may need some dedicated work and good explanations in clients. For example, the Web Editor specifies the CRS always as 3857 as it's the CRS of the mapping library you choose the bbox with.
Also, a GeoJSON geometry is always WGS84 and as such you wouldn't have a way to specify the CRS.

Therefore it would be better to specify the CRS separate from the spatial_extent, I think.

In general, there are three other options in resample_spatial (resolution, method, align). Do we need any of them in load_collection? I guess we don't need to think about resolution and then method (and align?) are not needed, right?

The collection-id endpoint however has as one field also the native projection of the data cube (eo:epsg).

Side note: This changes in STAC 0.9 an you can specify multiple projections per collection.

So @edzer using the spatial_extent as it is implemented now would limit us to work in that one projection, which might not always be desirable.

The data cube always has one projection. If you need multiple projections in a workflow, create different data cubes.

@kempenep

If load_collection results in a (non-virtual) data cube, some interpolation/projection has been involved already. A subsequent resample_spatial or temporal implies another resampling step.

I would think we don't need any dedicated temporal resampling in load_collection?!

To make things worse, an additional parameter would be the data type. Some back-ends might use Float32/64 as default, others adopt the native data type as stored on file.

I don't think we should expose this to the user. If there's really a conflict then it's up to the back-end to decide. At the moment we don't even expose the internal data type to a user anyway.

m-mohr commented 4 years ago

Okay, I created PR Open-EO/openeo-processes#102 so that we have an actual proposal to discuss and improve.

lforesta commented 4 years ago

Telco discussion -> no conclusion reached.

Main question to answer is, do we allow an openEO datacube to have multiple CRS associated with it? Currently, this is implicitly the case.

Currently, the user can force a specific projection by using resample_spatial (e.g. just after load_collection). But this is not good enough for some back-ends which may load data to memory directly with the load_collection process (hence the need for CRS to be a parameter of load_collection).

m-mohr commented 4 years ago

Currently it is not expected to have multiple CRS per data cube (although I'm not sure this is documented somewhere), which leads to undefined behavior if the we don't implement Open-EO/openeo-processes#102 or an alternative approach.

jdries commented 4 years ago

The point made during the telco is that in a lot of cases the user doesn't really care about the CRS, even if it is multi-crs. If the user wants a specific single crs, than he should enforce it using resample_spatial. If the user doesn't enforce it, the backend will choose the best option when needed. The most important result of this is that a generated raster (e.g. geotiff) can have a different crs depending on the backend. This was discussed, but it wasn't clear why this would be a problem. (Given that the option to enforce an output crs is there.)

m-mohr commented 4 years ago

The crs field is not required for single-CRS collections. If he doesn't care, he doesn't specify it, although for multi-CRS collections it should throw an error as (at the moment) the data cube model in openEO bases on a single CRS to make things comparable and reproducible. Just letting the back-end choose the best option is by no means comparable or reproducible.

jdries commented 4 years ago

Users that:

want to compare results between backends
AND don't know how to compare results with a different CRS will have to specify a resample operation, that's not too hard.
There's a lot of use cases that don't fall in this category, for them things will be even easier and more effective in terms of cost and performance because the backend is free to optimize. Unfortunately, Sentinel-2 is a multi-crs collection for a lot of backends, I don't want to throw an error if a user happens to forget the CRS, or doesn't care about it.

jdries commented 4 years ago

To come to a conclusion on this one, I would say that there is at least more backend implementation work/experiments needed, especially in the area of multi-crs collections. On VITO side, this is normally planned for the next months. I don't think this issue is blocking anything so I propose to put it on-hold until there are new arguments pro or con.

m-mohr commented 4 years ago

We discussed this a little more here and in principle we may also need a target resolution parameter in addition to the target CRS.

m-mohr commented 4 years ago

Telco: PR Open-EO/openeo-processes#102 has been closed in favor of a new approach. We are trying to make CRS a dimension as described here: openEO_dimensions.pdf

ToDos:

Edzer is going to update the glossary accordingly.
I'm going to check the API for potential changes.
Processes will be updated only whenever we find issues

edzer commented 4 years ago

I added a section here; for @jdries to review - is this comprehensible?

jdries commented 4 years ago

Yes, I think the problem and solution is clearly explained. More details on actual impact on the implementation might be added when we get to that point.

m-mohr commented 4 years ago

Regarding the last part:

The "reduction" (removal) of the crs dimension that is meaningful involves the resampling/warping of all sub-cubes for the crs dimension to a single, common target coordinate reference system.

I'm thinking 1) how this translates into a process graph and 2) whether this implies any changes to the processes then? It probably needs clarifications in the spatial resample processes. There it doesn't talk about (removal of) dimensions at all.

m-mohr commented 3 years ago

Moving to https://github.com/Open-EO/openeo-processes/issues/251

Open-EO / openeo.org

CRS as dimension in data cubes #9