Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 17 forks source link

aggregate_spatial: crs of data and geometry mismatch undefined #499

Open ValentinaHutter opened 8 months ago

ValentinaHutter commented 8 months ago

aggregate_spatial: crs of data and geometry mismatch undefined

Undefined crs handling: In aggregate_spatial, the input parameters data and geometries specify a raster datacube and a vector datacube, the output is a vector datacube. A user might define the geometries in the wgs84 CRS, while the data coming from openeo must not necessarily be in wgs84. There are three options to handle this:

  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.
  3. An error is thrown, as the user should reproject data or geometries before.

Wouldn't it make sense to define or at least note it on the specification level?

Proposed solution: We currently use Nr. 1, as it means using the CRS from the first input parameter.

Additional context: The process is currently being tested on various input datasets, which differ in their CRS.

Backends normally handle the data CRS themselves, but for using this process, it can easily happen that there are two CRS options, which are both equally valid.

m-mohr commented 8 months ago

Option 1 was the intended behavior. We should clarify this indeed. PRs are welcome.

soxofaan commented 7 months ago
  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.

I think it should be a mix of 1 and 2.

While one could argue that in most use cases it's practically not very relevant which one you reproject to the other, I guess the most obvious choice for backend implementations is to reproject the geometries to the raster data CRS (e.g. geometries are typically just EPSG4326, and relatively cheap to reproject to the native raster data CRS, which you want to stay in for processing efficiency).

On the other hand, I think the user expectation is to get the same output geometries (in same CRS) as in the input vector cube. This might be even vital for ML application where you want to "join" the aggregation output data with target variables that are associated with the original input geometries.

So:

1+2: The geometries are reprojected to the data CRS and resulting vector datacube has the CRS of the input geometries

ValentinaHutter commented 7 months ago

Thanks for the input - I agree, it makes sense to have both options! I think it is fine, to leave this decision to the individual backend implementations - but this should really be documented somewhere for new developers. Does it make sense to include this in the specification? Maybe, we could add a section to the specification, such as "note for developers" or simply "note"?

soxofaan commented 7 months ago

Maybe my explanation was a bit messy, I didn't mean to have/keep both options. I meant:

That second bullet point should certainly be documented in the process spec (as it is both relevant to backend dev and end user). The first bullet point could be a recommendation, but I'm not sure it is a vital part to be documented

clausmichele commented 4 months ago

If we agree on what @soxofaan proposed, we have to make sure to reuse the input geometries without reprojecting them twice, since it could lead to differences due to floating point rounding (it just append to me!).

m-mohr commented 4 months ago

Yeah, thinking more about it the important part is probably that you get the geometries as provided, i.e. in the CRS of the source geometry without changes to the coordinates etc.

We should probably describe that and then also name how this is done in the background. Maybe it's actually simpler to reproject only the source data to the geometry CRS?

soxofaan commented 4 months ago

The key point is indeed that the output geometries are exactly the same as the input geometries, without any (back and forth) projections or CRS changes in between. This probably needs some clarification in the docs.

How the aggregations are practically calculated is more an implementation detail for the backend implementer I think and I'm not sure the typical user will care about the difference between:

  1. reproject the geometries to the raster data CRS and aggregate.
    • This is probably the most straightforward and cheapest approach in most, if not all, practical use cases.
    • This is the approach we currently take in the geopyspark backend FYI.
  2. reproject the raster data to the geometry CRS and aggregate.
    • maybe there are reasons this is scientifically/mathematically more correct?
    • but what if there is no reasonable (implied) spatial resolution associated with the geometry CRS? If your geometry is in a default EPSG 4326, you don't want to reproject your 10 meter raster data pixels to 1 degree pixels, right?

Regardless, I'm not sure we should over-specify this aspect. However putting a recommendation (for option 1 I think) for the sake of reproducible results could be an option

clausmichele commented 4 months ago

The process definition should also be fixed since now it's mentioned to use dimension of type geometries, but from the parameter description below and from the official API docs it should be geometry instead: https://processes.openeo.org/#aggregate_spatial https://api.openeo.org/#tag/EO-Data-Discovery/operation/describe-collection

clausmichele commented 4 months ago
1. reproject the geometries to the raster data CRS and aggregate.

   * This is probably the most straightforward and cheapest approach in most, if not all, practical use cases.
   * This is the approach we currently take in the geopyspark backend FYI.

@soxofaan which CRS should we use in the output vector cube in your opinion? The raster cube projection or the vector cube projection?

soxofaan commented 4 months ago

In my opinion the output vector cube should have the exactly geometry dimension as a whole (so same geometries, same order, same CRS) as the input vector cube.

This is what I intended to say with:

The key point is indeed that the output geometries are exactly the same as the input geometries, without any (back and forth) projections or CRS changes in between. This probably needs some clarification in the docs.