Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 16 forks source link

Random Forest: Training/Regression, Classifier/Predicting... #295

Closed m-mohr closed 2 years ago

m-mohr commented 2 years ago

We need two (or one?) new processes for Random Forest that support classification and regression.

Would training happen outside of openEO for now?

Implementations:

PS: That's a lot of parameters, wow!

-> Related: save_model / load_model with GLMLC metadata: #300

jdries commented 2 years ago

We'll need training as well, as the saved model formats may be specific to the implementation used? I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

m-mohr commented 2 years ago

We'll need training as well, as the saved model formats may be specific to the implementation used?

Ok good. I wasn't sure whether this would be provided through file upload but that's actually not yet a thing in Platform.

I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

I guess that depends a lot on how the individual processes for training, classification and regression would look like afterwards. If you have a lot of parameters, they should probably be separate otherwise you end up in a mess with schemas. If they are just "choose a method and a file" or so, we might be able to merge them into a generic one. Let's see, I still need to do more research as I don't have a lot of experience with all this, unfortunately...

mattia6690 commented 2 years ago

Recap of today's meeting on the randomForest process:

For more information, I put the Presentation here. This is a kickstarter for the UC8 implementation

jdries commented 2 years ago

Some feedback based on internal discussion at VITO:

m-mohr commented 2 years ago
  • New process for sampling might be useful in the future (new Issue already @m-mohr?)

Yes, quickly opened one here: https://github.com/Open-EO/openeo-processes/issues/313

edzer commented 2 years ago

Thanks, helpful! Here is a sketch of the process(es), as I see them, high-level (for pixel-wise ML methods, such as RF). Following the ML terminology, I use labels for the response (e.g. crop type; either a class variable or a continuous variabe) and features for the predictors (e.g. the bands, or bands x time, based on which a RF predicts a class given a model).

As @mattia6690 notes, there are two separate steps: A train model, B predict on new features

A train model

See below for how we get to these input data, e.g. from polygons

B Predict (classify, regress)

data for A: train model

Typical steps needed before we can train the model (A3) are:

Note that step A1.2 + A2: for a set of polygons and a raster (cube), return the raster pixel centers and all the associated pixel values, is a very common operation; in R it is usually called extract.

jdries commented 2 years ago

Nice overview! For A1.1 we will first write a script that does this client side, where we have all flexibility to do that in whatever way we like, but I'm not opposed to also defining it as a process, same for A1.2, which seems even simpler.

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

m-mohr commented 2 years ago

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

Yes, that's actually what we discussed yesterday but Edzer did not mention it explicitly. So to visualize it with a bit of JS-like pseudo-code for B1:

p = new ProcessBuilder()
cube = p.load_collection('S2')
model = p.load_ml_model('my_model_job')
reducer = function(data, context) {
  return this.predict_rf(data = data, model = context)
}
x = p.reduce_dimension(data = cube, reducer = reducer, dimension = 'bands', context = model)
...

Not fully fleshed out yet, but to give an idea...