m-mohr commented 2 years ago

We need two (or one?) new processes for Random Forest that support classification and regression.

Would training happen outside of openEO for now?

Implementations:

Fortran: https://www.stat.berkeley.edu/users/breiman/RandomForests/cc_manual.htm
R randomForest: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest
R ranger: https://www.rdocumentation.org/packages/ranger/versions/0.13.1/topics/ranger
Python sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html / https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Spark MLlib: http://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests
Check the eo-learn implementation from Sinergise
and many more...

PS: That's a lot of parameters, wow!

-> Related: save_model / load_model with GLMLC metadata: #300

jdries commented 2 years ago

We'll need training as well, as the saved model formats may be specific to the implementation used? I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

m-mohr commented 2 years ago

We'll need training as well, as the saved model formats may be specific to the implementation used?

Ok good. I wasn't sure whether this would be provided through file upload but that's actually not yet a thing in Platform.

I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

I guess that depends a lot on how the individual processes for training, classification and regression would look like afterwards. If you have a lot of parameters, they should probably be separate otherwise you end up in a mess with schemas. If they are just "choose a method and a file" or so, we might be able to merge them into a generic one. Let's see, I still need to do more research as I don't have a lot of experience with all this, unfortunately...

mattia6690 commented 2 years ago

Recap of today's meeting on the randomForest process:

need to flatten the data from Vector cube to 2d (table-like) object as RF input
specifiy dimension(s) of a vector cube that act as predictors for the model
Two separate functions for training and prediction
New process for sampling might be useful in the future (new Issue already @m-mohr?)
Processes will be based on vector cubes instead of raster cubes to allow for more flexibility to the user (e.g. import of Polygons and Lines possible)

For more information, I put the Presentation here. This is a kickstarter for the UC8 implementation

jdries commented 2 years ago

Some feedback based on internal discussion at VITO:

the landcover use case will require prediction on raster cubes, training can happen on vector cubes. (We need to produce a map at the end.)
for training, we can convert our polygons into a set of points (offline), where we basically sample each polygon with a number of points. That would allow us to use a process like 'aggregate_spatial' for the raster to vector conversion, because the use of points has the effect that original pixel values are maintained.
in our case, the flattening has been taken care of by apply_dimension, but it's fine if another process is defined for that (doing the same thing)

m-mohr commented 2 years ago

New process for sampling might be useful in the future (new Issue already @m-mohr?)

Yes, quickly opened one here: https://github.com/Open-EO/openeo-processes/issues/313

edzer commented 2 years ago

Thanks, helpful! Here is a sketch of the process(es), as I see them, high-level (for pixel-wise ML methods, such as RF). Following the ML terminology, I use labels for the response (e.g. crop type; either a class variable or a continuous variabe) and features for the predictors (e.g. the bands, or bands x time, based on which a RF predicts a class given a model).

As @mattia6690 notes, there are two separate steps: A train model, B predict on new features

A train model

input: "locations" (points, pixels) with:
- labels
- features
input: hyper-parameters
output: "model"

See below for how we get to these input data, e.g. from polygons

B Predict (classify, regress)

input: data: raster data cube with features as a dimension
input: dimension: feature dimension name
input: context: "model"
two options:
- B1: we only predict a class, or a scalar
  - input: reducer: needs to be defined: takes the model, returns the class
  - output: data cube with labels (class, or cont. variable)
  - this is (a special case for) reduce_dimension
- B2: we want probabilities for each class (a standard option for any classifier)
  - input: process: needs to be defined: takes the model, returns the class probabilities
  - output data cube with probabilities per class (class is dimension, probability the attribute)
  - this is (a special case for) apply_dimension with target_dimension = "class"

data for A: train model

Typical steps needed before we can train the model (A3) are:

case A1: training data consist of polygons and their class values, where polygons are uniform in their class value. This needs a method to either:
- A1.1 sample points within the polygons, given some sampling strategy (random? regular?) and sample size
- A1.2 given a raster data cube, find all the pixel centers within the polygons: may require a new process ("extract?" - could be combined with A2):
  - input: polygon geometries
  - input: raster data cube
  - output: POINT geometries of pixel centers inside the polygon, with associated polygon ID
output of A1: Point locations + labels -> go to case A2
case A2: extract features at the training point locations: we think this should happen with aggregate_spatial when called with POINT geometries (although no aggregation takes place):
- input: data: raster data cube with features
- input: geometries: POINT locations + labels
- input: reducer: array_element with index 0
- output: point locations + features at these points -> go to case A3
case A3: train model:
- input: point locations + features (output of A2)
- input: hyper-parameters
- output: model

Note that step A1.2 + A2: for a set of polygons and a raster (cube), return the raster pixel centers and all the associated pixel values, is a very common operation; in R it is usually called extract.

jdries commented 2 years ago

Nice overview! For A1.1 we will first write a script that does this client side, where we have all flexibility to do that in whatever way we like, but I'm not opposed to also defining it as a process, same for A1.2, which seems even simpler.

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

m-mohr commented 2 years ago

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

Yes, that's actually what we discussed yesterday but Edzer did not mention it explicitly. So to visualize it with a bit of JS-like pseudo-code for B1:

p = new ProcessBuilder()
cube = p.load_collection('S2')
model = p.load_ml_model('my_model_job')
reducer = function(data, context) {
  return this.predict_rf(data = data, model = context)
}
x = p.reduce_dimension(data = cube, reducer = reducer, dimension = 'bands', context = model)
...

Not fully fleshed out yet, but to give an idea...

Open-EO / openeo-processes

Random Forest: Training/Regression, Classifier/Predicting... #295

A train model

B Predict (classify, regress)

data for A: train model