Vector data cubes (overview)

Open-EO / openeo.org

openeo.org landing page

https://openeo.org

Apache License 2.0

6 stars 16 forks source link

Vector data cubes (overview) #58

Open m-mohr opened 2 years ago

m-mohr commented 2 years ago

What we need to do to add vector data cubes in openEO:

[x] Update the glossary and data cube guide (PR: https://github.com/Open-EO/openeo.org/pull/59)
openEO API
- [x] Update openEO API (PR: https://github.com/Open-EO/openeo-api/pull/441)
- [x] https://github.com/Open-EO/openeo-api/issues/479
- [ ] Release openEO API
STAC datacube extension
- [x] Update STAC datacube extension (PR: https://github.com/stac-extensions/datacube/pull/10)
- [x] Release STAC datacube extension
openEO processes
- [x] Update openEO Processes (PR: https://github.com/Open-EO/openeo-processes/pull/382):
- [x] https://github.com/Open-EO/openeo-processes/issues?q=is%3Aopen+label%3Avector+milestone%3A2.0.0
- [ ] Release openEO processes (release candidates + stable)
[ ] Update clients and back-ends
- [ ] Python
- [ ] R
- [ ] JS + libraries
- [ ] Vue Components
- [ ] Web Editor
- [ ] Processes DocGen

Questions:

Do we want to restrict geometries to one geometry type per vector dimension?
- Tendency: No, allow mixtures
Do we restrict to only Point, LineString, Polygon, and the Multi-variants (and thus exclude e.g. PolyhedralSurface)? We already discourage GEOMETRYCOLLECTION in several processes.
- Tendency: Yes, restrict to the types mentioned above
How do we handle null/empty geometries?
- Tendency: Don't allow them / skip them during import
Representation of "dimension labels" (in STAC: "values")?
- In metadata: ID, WKT, or GeoJSON (see STAC data cube extension PR)
- In processes: It is just a representation so we can do multiple things, e.g. allow users to choose between WKT and ID. Or we need to decide on one of them. We can't really use "1D vector cubes" as labels (unless we change it).
How to handle units in processes? See https://github.com/Open-EO/openeo-processes/issues/330
Define (and describe) generally how to convert vector data into a vector data cube: https://github.com/Open-EO/openeo-processes/issues/346#issuecomment-1073758539 There's a proposal from Brockmann, GeoJSON could be aligned with STAC (datetime in properties)

mkadunc commented 2 years ago

Do we want to restrict geometries to one geometry type each per vector dimension?

I'm not a fan of restricting in the standard; maybe, if restricting would be required for easier implementation, we could add this info to the backend capabilities.

It would be useful to have metadata about types in a dimension for specific data cubes, though - i.e. if I load a vector cube, it would be good to know which geometry types to expect for the labels on the spatial dimension.

Point, LineString, Polygon, and the Multi-variants...

I agree that we leave out PolyhedralSurface etc. (for now). GeometryCollection is borderline - some vector operations might return GC in which case we'll have to "normalize" the results to the higher-dimensional type (e.g. an intersection of two linestrings will most likely be a point, but could also be a linestring; if we support only one type, we'll have to represent all points as degenerate single-point linestrings).

Having looked at OGC EDR, it seems that support for XYZ and XYM / XYZM geometries would be useful.

Representation of "dimension labels" (in STAC: "values")?

I'd say GeoJSON (the 'non-standard' one with CRS).

I don't think ID is necessary - if you strip geometry values from a vector cube, it becomes a non-vector data-cube IMO.

Do we need the actual geometries in callbacks?

I'd say yes - let's treat geometry labels same as any other labels (e.g. named bands).

How do we handle processes that now require "raster-cubes"

Rename raster-cube to data-cube in the schema and replace everywhere. Then introduce raster-cube as a subclass, and use it instead of data-cube in processes that do special things with raster spatial dimensions (x,y).

What name do we recommend for the vector dimension?

geometry seems better than vector. feature would also be an option, or reference-geometry

m-mohr commented 2 years ago

Thanks, @mkadunc. Interesting that several of your points are exactly contrary to what @edzer proposed to me before. I guess you can have some good discussions here while I'm on vacation. ;-)

It would be useful to have metadata about types in a dimension for specific data cubes, though

That's a pretty good idea indeed. I should add that to https://github.com/stac-extensions/datacube/pull/10

GeometryCollection is borderline

Right now we say in processes that

To maximize interoperability, a nested GeometryCollection should be avoided. Furthermore, a GeometryCollection composed of a single type of geometries should be avoided in favour of the corresponding multi-part type (e.g. MultiPolygon).

Not sure what backends actually do with this in implementation though.

I'd say GeoJSON (the 'non-standard' one with CRS).

Then it's not GeoJSON though. So you mean the real invalid one (I'd like to avoid that) or were you referring to this new JSON-FG from OGC? https://github.com/opengeospatial/ogc-feat-geo-json (I could see us using that, but it's WIP).

Rename raster-cube to data-cube in the schema and replace everywhere. Then introduce raster-cube as a subclass, and use it instead of data-cube in processes that do special things with raster spatial dimensions (x,y).

That's breaking and requires processes v2.0. I assume implementors will not be happy about it. (Also, in the schemas we don't really have subclasses except from subclassing native types).

edzer commented 2 years ago

geometry seems better than vector. feature would also be an option, or reference-geometry

I also like geometry, or alternatively feature_geometry. In SFA a feature is a thing that has a geometry and other attributes.

I think I'm also in favour of a GeoJSON that does not restrict to EPSG:4326. Although that is a (IETF) standard, it's clearly out of date and not good enough for today's requirements. But the individual feature geometries must then each come with a CRS, right? Or will the CRS be a property of the metadata for the dimension as a whole?

m-mohr commented 2 years ago

Discussed with @edzer:

Allow different types per dimension.
Yes, restrict to the types mentioned above
Representation of "dimension labels": In metadata: see STAC - In processes: Vector cube, 1 vector dimension, 1 label
?
See https://github.com/Open-EO/openeo-processes/issues/330
geometry

m-mohr commented 2 years ago

Question 7: What do we do with additional "metadata", e.g. ids and properties assigned to a feature? Related: https://github.com/Open-EO/openeo-processes/issues/347#issuecomment-1070742781

Not sure about the IDs, but I guess for vector data you specify which properties to load into the data cube (as additional dimension if 2+ properties) and the rest is kept somewhere in the background. So we may want to add id and properties as additional optional fields to the vector dimension. There's no way to access these information through processes right now, but we should probably state that id and properties are kept untouched in general by processes unless otherwise stated by processes.

This is issue about the additional metadata that is present at the start and may get passed through and should be included in the result is also very much unspecified for raster, by the way.

mkadunc commented 2 years ago

for vector data you specify which properties to load into the data cube (as additional dimension if 2+ properties) and the rest is kept somewhere in the background. So we may want to add id and properties as additional optional fields to the vector dimension.

I'm not sure I understand this 'additional dimension' part — say we have a vector cube which stores a real-valued variable mean_reflectance with 3 dimensions (geometry, time, band), and we want to load 2 extra properties for vector data (e.g. id, land_class):

if extra properties are loaded as additional dimension, then:
- the variable changes and becomes just value (openEO concept of a data cube does not allow for more than one variable)
- the type of the variable changes from real/float to any (or string... something that can capture the original variable and all types of the extra properties)
- the extra dimension basically takes the role of variable, and has labels {'mean_reflectance,id,land_class`}
- the cube is quite unbalanced - the sub-cubes for variable indices of id and land_class are basically 1-D (value is constant along time and band dimensions), and the sub-cube for variable = mean_reflectance is 3-D
if extra properties (id and land_class) are additional fields on the vector dimension:
- the variable stays mean_reflectance and keeps its type, regardles of any extra properties loaded
- extra properties are stored on the geometry dimension, e.g. inside its labels (if labels are GeoJSON, we could use 'feature' object type to store this; or we allow labels to be tuples, generic JSON dictionaries or arrays)

soxofaan commented 2 years ago

I agree with @mkadunc and had the same conceptual struggle in https://github.com/Open-EO/openeo-processes/issues/356

m-mohr commented 2 years ago

I think we need to discuss this again in detail with all experts. As we are close to the end of SRR3, we will likely not be able to tackle it beforehand so I'd propose to have a dedicated meeting afterward (or discuss it in Bolzano).

m-mohr commented 2 years ago

Some notes from the April PSC meeting:

Uni Salzburg - zgis is also working on data cubes
CovJson - interesting file format, now also maturing in the OGC
Set up a meeting with EDC in the next weeks, all to write documents about the understanding of data cubes until LPS, then have some dedicated time in June to work on stuff (didn't happen => summer)
Maybe don’t add vector-cube and raster-cube subtypes and just specify data-cube as a type and then specify in processes required dimension types. Inheritance is a problem in process definitions, but lack of inheritance might be an issue in implementations like the Python client.
We may want to consider going towards openEO processes 2.0