`fit_*_random_forest` : allow vector-cube column selection

clausmichele commented 2 years ago

After a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:

The fit_*_random_forest processes have vector-cubes as input data types
The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)
This is required if the target vector-cube is not the output of aggregate_spatial but it comes from a file (it is the case of UC8). Sorry if this didn't come up earlier but I wasn't fully aware on how the target vector-cube should look like.

@m-mohr how would you add the column selection to the process? My idea of vector-cube, consists in a table where we must have the geometry column, all the rest is optional. We would like to select a column among those optional ones.

For example, from this vector-cube we could select the class field from the geoJSON properties; https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson

m-mohr commented 2 years ago

I'm also not 100% sure yet how we handle properties in such cases. In theory, such a property could be a dimension or it still resides in the features, which would require additional processes.

Regardless of how the vector cube looks like, we likely need to add a new parameter for this use case to the fit_* processes.

clausmichele commented 2 years ago

This is for example my version of a vector-cube: a table with the geometry field plus some other columns. This would be the output of aggregate_spatial for instance

idx	class	geometry	result	result_meta
0	0	POINT (11.12157 46.06955)	{'B04_10m': 1522.0, 'B08_10m': 2127.5}	{'total_count': 2.0, 'valid_count': 2.0}
1	0	POINT (11.12171 46.06909)	{'B04_10m': 1121.5, 'B08_10m': 1547.0}	{'total_count': 2.0, 'valid_count': 2.0}
2	0	POINT (11.12324 46.06875)	{'B04_10m': 1045.5, 'B08_10m': 1406.0}	{'total_count': 2.0, 'valid_count': 2.0}
3	0	POINT (11.12326 46.06900)	{'B04_10m': 1093.5, 'B08_10m': 1497.5}	{'total_count': 2.0, 'valid_count': 2.0}
4	0	POINT (11.12370 46.06757)	{'B04_10m': 1455.0, 'B08_10m': 2059.5}	{'total_count': 2.0, 'valid_count': 2.0}
...	...	...	...	...
100	1	POINT (11.14474 46.13561)	{'B04_10m': 812.5, 'B08_10m': 3255.5}	{'total_count': 2.0, 'valid_count': 2.0}
101	1	POINT (11.15225 46.13465)	{'B04_10m': 614.0, 'B08_10m': 2137.5}	{'total_count': 2.0, 'valid_count': 2.0}
102	1	POINT (11.15215 46.13448)	{'B04_10m': 559.5, 'B08_10m': 2053.0}	{'total_count': 2.0, 'valid_count': 2.0}
103	1	POINT (11.15340 46.13687)	{'B04_10m': 549.5, 'B08_10m': 2001.0}	{'total_count': 2.0, 'valid_count': 2.0}
104	1	POINT (11.15475 46.13695)	{'B04_10m': 596.0, 'B08_10m': 1806.5}	{'total_count': 2.0, 'valid_count': 2.0}

edzer commented 2 years ago

The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)

In our definition of a data cube, cell values are scalars (see here, sixth paragraph), we do not allow for multiple variables (or attributes) at each combination of dimension values (cell = scalar). For vector data cubes, this should also hold. So by defining the vector data cube from the geojson file above you need to specify first which field holds the data cube values (cell values), which should be class, before you can call it a data cube. Then you end up with a one-dimensional vector data cube with geometries (points) as dimension values, and the class values as cell values. And hence, then you don't need to select an attribute.

If the geojson file above would have 4 attributes, B1, B2, B3, B4, they could contain the values of a second dimension, band, with dimension values B1,...,B4, and imported as a vector data cube with dimension npoints x 4. Each combination (POINT, band) would give a single, scalar value.

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

m-mohr commented 2 years ago

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

I assume "geojson file" was meant to refer to the table that Michele posted here? https://github.com/Open-EO/openeo-processes/issues/341#issuecomment-1067886942

clausmichele commented 2 years ago

But then how should the output of aggregate_spatial look like if the input has dimensions x,y,bands or x,y,time ? I couldn't find any other way of satisfying the requirements of the process, since we need to create a new target dimension (not multiple) which can hold values from different bands or timesteps. https://processes.openeo.org/#aggregate_spatial

m-mohr commented 2 years ago

As far as I understand it, the definition of aggregate_spatial simply doesn't work with the definition that @edzer proposes.

clausmichele commented 2 years ago

Well, we will keep this internal representation of the vector-cube for now since we need it for UC8, when we'll have a clearer definition we could modify it.

m-mohr commented 2 years ago

@clausmichele I guess it would be better to flatten result and result_meta? Does that make sense?

If I understood @edzer correctly, then the data cube may look like this:

Dimensions:

geometry with labels: POINT (11.12157 46.06955), POINT (11.12171 46.06909), ...
properties (or result to follow the default target dimension in aggregate_spatial) with labels: class, B04_10m, B08_10m, total_count, valid_count

geometry v \ properties >	class	B04_10m	B08_10m	total_count	valid_count
POINT (11.12157 46.06955)	0	1522.0	2127.5	2	2
POINT (11.12171 46.06909)	0	1121.5	1547.0	2	2
…	…	…	…	…	…

We'd need to fix the description of the returned vector cube in aggregate_spatial then.

clausmichele commented 2 years ago

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

m-mohr commented 2 years ago

Although it actually seems that total_count and valid_count should be per band in this case? aggregate_spatial says per geometry, but this seems pretty useless. Then the class is somewhat getting in the way...

clausmichele commented 2 years ago

I've computed it per geometry -> 1 point selected with two bands -> two valid pixels.

ValentinaHutter commented 2 years ago

Fine for me! I've just discussed with @ValentinaHutter and she was also implementing it like this.

Yes, to make fit_regr_random_forest work at EODC it made more sense for us to have a separate column for every band we have. For now I use a predictors_vars parameter, which specifies the bands that are used in the predictors and I use a target_var parameter to specify the band that is used there. My predictors_vars would be a list (for example ['B04', 'B08']) and the target_var would be a string (for example 'ndvi', if thats the name of the band). Of course this is just for testing it and I will update the parameters later.

soxofaan commented 2 years ago

If I understood @edzer correctly, then the data cube may look like this: ...

I think there is a problem here as the "class" column is a string in the example that @clausmichele posted (https://raw.githubusercontent.com/clausmichele/openeo_aggregate_spatial_vector_cubes/master/urban_forest_points.geojson), which is probably not unusual in practice. And theoretically there is also a bit of conflict between the float aggregation columns and integer count columns. That's wat @edzer was noting too:

In the above geojson file the attributes are of different type, so can never be molded into a (useful) data cube dimension.

m-mohr commented 2 years ago

Yes, that's what I'm also struggling with right now in general. We need more discussions on this.

clausmichele commented 2 years ago

Well, if the string it's the issue it can be easily converted into a number.

soxofaan commented 2 years ago

Well, if the string it's the issue it can be easily converted into a number.

In this example it's probably easy. But I don't think you can do that "easily" or automatically in general. I think it indicates that there is something conceptually wrong in how we define/handle vector cubes.

You could also theoretically argue that mixing the float aggregation columns with integer count columns in the same "cube" is bad practice.

Open-EO / openeo-processes

`fit_*_random_forest` : allow vector-cube column selection #341