Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 14 forks source link

aggregate_spatial: crs of data and geometry mismatch undefined #499

Open ValentinaHutter opened 3 months ago

ValentinaHutter commented 3 months ago

aggregate_spatial: crs of data and geometry mismatch undefined

Undefined crs handling: In aggregate_spatial, the input parameters data and geometries specify a raster datacube and a vector datacube, the output is a vector datacube. A user might define the geometries in the wgs84 CRS, while the data coming from openeo must not necessarily be in wgs84. There are three options to handle this:

  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.
  3. An error is thrown, as the user should reproject data or geometries before.

Wouldn't it make sense to define or at least note it on the specification level?

Proposed solution: We currently use Nr. 1, as it means using the CRS from the first input parameter.

Additional context: The process is currently being tested on various input datasets, which differ in their CRS.

Backends normally handle the data CRS themselves, but for using this process, it can easily happen that there are two CRS options, which are both equally valid.

m-mohr commented 3 months ago

Option 1 was the intended behavior. We should clarify this indeed. PRs are welcome.

soxofaan commented 3 months ago
  1. The geometries get reprojected to the data CRS and the resulting vector datacube has the CRS of the input data.
  2. The data gets reprojected to the geometries CRS and the resulting vector datacube has the CRS of the input geometries.

I think it should be a mix of 1 and 2.

While one could argue that in most use cases it's practically not very relevant which one you reproject to the other, I guess the most obvious choice for backend implementations is to reproject the geometries to the raster data CRS (e.g. geometries are typically just EPSG4326, and relatively cheap to reproject to the native raster data CRS, which you want to stay in for processing efficiency).

On the other hand, I think the user expectation is to get the same output geometries (in same CRS) as in the input vector cube. This might be even vital for ML application where you want to "join" the aggregation output data with target variables that are associated with the original input geometries.

So:

1+2: The geometries are reprojected to the data CRS and resulting vector datacube has the CRS of the input geometries

ValentinaHutter commented 3 months ago

Thanks for the input - I agree, it makes sense to have both options! I think it is fine, to leave this decision to the individual backend implementations - but this should really be documented somewhere for new developers. Does it make sense to include this in the specification? Maybe, we could add a section to the specification, such as "note for developers" or simply "note"?

soxofaan commented 3 months ago

Maybe my explanation was a bit messy, I didn't mean to have/keep both options. I meant:

That second bullet point should certainly be documented in the process spec (as it is both relevant to backend dev and end user). The first bullet point could be a recommendation, but I'm not sure it is a vital part to be documented

clausmichele commented 2 days ago

If we agree on what @soxofaan proposed, we have to make sure to reuse the input geometries without reprojecting them twice, since it could lead to differences due to floating point rounding (it just append to me!).

m-mohr commented 2 days ago

Yeah, thinking more about it the important part is probably that you get the geometries as provided, i.e. in the CRS of the source geometry without changes to the coordinates etc.

We should probably describe that and then also name how this is done in the background. Maybe it's actually simpler to reproject only the source data to the geometry CRS?