Open clausmichele opened 2 years ago
I'm also not 100% sure yet how we handle properties in such cases. In theory, such a property could be a dimension or it still resides in the features, which would require additional processes.
Regardless of how the vector cube looks like, we likely need to add a new parameter for this use case to the fit_* processes.
This is for example my version of a vector-cube: a table with the geometry field plus some other columns. This would be the output of aggregate_spatial for instance
idx | class | geometry | result | result_meta |
---|---|---|---|---|
0 | 0 | POINT (11.12157 46.06955) | {'B04_10m': 1522.0, 'B08_10m': 2127.5} | {'total_count': 2.0, 'valid_count': 2.0} |
1 | 0 | POINT (11.12171 46.06909) | {'B04_10m': 1121.5, 'B08_10m': 1547.0} | {'total_count': 2.0, 'valid_count': 2.0} |
2 | 0 | POINT (11.12324 46.06875) | {'B04_10m': 1045.5, 'B08_10m': 1406.0} | {'total_count': 2.0, 'valid_count': 2.0} |
3 | 0 | POINT (11.12326 46.06900) | {'B04_10m': 1093.5, 'B08_10m': 1497.5} | {'total_count': 2.0, 'valid_count': 2.0} |
4 | 0 | POINT (11.12370 46.06757) | {'B04_10m': 1455.0, 'B08_10m': 2059.5} | {'total_count': 2.0, 'valid_count': 2.0} |
... | ... | ... | ... | ... |
100 | 1 | POINT (11.14474 46.13561) | {'B04_10m': 812.5, 'B08_10m': 3255.5} | {'total_count': 2.0, 'valid_count': 2.0} |
101 | 1 | POINT (11.15225 46.13465) | {'B04_10m': 614.0, 'B08_10m': 2137.5} | {'total_count': 2.0, 'valid_count': 2.0} |
102 | 1 | POINT (11.15215 46.13448) | {'B04_10m': 559.5, 'B08_10m': 2053.0} | {'total_count': 2.0, 'valid_count': 2.0} |
103 | 1 | POINT (11.15340 46.13687) | {'B04_10m': 549.5, 'B08_10m': 2001.0} | {'total_count': 2.0, 'valid_count': 2.0} |
104 | 1 | POINT (11.15475 46.13695) | {'B04_10m': 596.0, 'B08_10m': 1806.5} | {'total_count': 2.0, 'valid_count': 2.0} |
The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)
In our definition of a data cube, cell values are scalars (see here, sixth paragraph), we do not allow for multiple variables (or attributes) at each combination of dimension values (cell = scalar). For vector data cubes, this should also hold. So by defining the vector data cube from the geojson file above you need to specify first which field holds the data cube values (cell values), which should be class, before you can call it a data cube. Then you end up with a one-dimensional vector data cube with geometries (points) as dimension values, and the class values as cell values. And hence, then you don't need to select an attribute.
If the geojson file above would have 4 attributes, B1, B2, B3, B4, they could contain the values of a second dimension, band, with dimension values B1,...,B4, and imported as a vector data cube with dimension npoints x 4. Each combination (POINT, band) would give a single, scalar value.
In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.
In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.
I assume "geojson file" was meant to refer to the table that Michele posted here? https://github.com/Open-EO/openeo-processes/issues/341#issuecomment-1067886942
But then how should the output of aggregate_spatial look like if the input has dimensions x,y,bands or x,y,time ? I couldn't find any other way of satisfying the requirements of the process, since we need to create a new target dimension (not multiple) which can hold values from different bands or timesteps. https://processes.openeo.org/#aggregate_spatial
As far as I understand it, the definition of aggregate_spatial simply doesn't work with the definition that @edzer proposes.
Well, we will keep this internal representation of the vector-cube for now since we need it for UC8, when we'll have a clearer definition we could modify it.
@clausmichele I guess it would be better to flatten result and result_meta? Does that make sense?
If I understood @edzer correctly, then the data cube may look like this:
Dimensions:
geometry
with labels: POINT (11.12157 46.06955), POINT (11.12171 46.06909), ...properties
(or result
to follow the default target dimension in aggregate_spatial) with labels: class, B04_10m, B08_10m, total_count, valid_countgeometry v \ properties > | class | B04_10m | B08_10m | total_count | valid_count |
---|---|---|---|---|---|
POINT (11.12157 46.06955) | 0 | 1522.0 | 2127.5 | 2 | 2 |
POINT (11.12171 46.06909) | 0 | 1121.5 | 1547.0 | 2 | 2 |
… | … | … | … | … | … |
We'd need to fix the description of the returned vector cube in aggregate_spatial then.
Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.
Although it actually seems that total_count and valid_count should be per band in this case? aggregate_spatial says per geometry, but this seems pretty useless. Then the class is somewhat getting in the way...
I've computed it per geometry -> 1 point selected with two bands -> two valid pixels.
Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.
Yes, to make fit_regr_random_forest work at EODC it made more sense for us to have a separate column for every band we have. For now I use a predictors_vars
parameter, which specifies the bands that are used in the predictors and I use a target_var
parameter to specify the band that is used there. My predictors_vars
would be a list
(for example ['B04', 'B08']) and the target_var
would be a string
(for example 'ndvi', if thats the name of the band). Of course this is just for testing it and I will update the parameters later.
If I understood @edzer correctly, then the data cube may look like this: ...
I think there is a problem here as the "class" column is a string in the example that @clausmichele posted (https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson), which is probably not unusual in practice. And theoretically there is also a bit of conflict between the float aggregation columns and integer count columns. That's wat @edzer was noting too:
In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.
Yes, that's what I'm also struggling with right now in general. We need more discussions on this.
Well, if the string it's the issue it can be easily converted into a number.
Well, if the string it's the issue it can be easily converted into a number.
In this example it's probably easy. But I don't think you can do that "easily" or automatically in general. I think it indicates that there is something conceptually wrong in how we define/handle vector cubes.
You could also theoretically argue that mixing the float aggregation columns with integer count columns in the same "cube" is bad practice.
After a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:
@m-mohr how would you add the column selection to the process? My idea of vector-cube, consists in a table where we must have the geometry column, all the rest is optional. We would like to select a column among those optional ones.
For example, from this vector-cube we could select the class field from the geoJSON properties; https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson