Closed m-mohr closed 2 years ago
We'll need training as well, as the saved model formats may be specific to the implementation used? I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.
We'll need training as well, as the saved model formats may be specific to the implementation used?
Ok good. I wasn't sure whether this would be provided through file upload but that's actually not yet a thing in Platform.
I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.
I guess that depends a lot on how the individual processes for training, classification and regression would look like afterwards. If you have a lot of parameters, they should probably be separate otherwise you end up in a mess with schemas. If they are just "choose a method and a file" or so, we might be able to merge them into a generic one. Let's see, I still need to do more research as I don't have a lot of experience with all this, unfortunately...
Recap of today's meeting on the randomForest process:
For more information, I put the Presentation here. This is a kickstarter for the UC8 implementation
Some feedback based on internal discussion at VITO:
- New process for sampling might be useful in the future (new Issue already @m-mohr?)
Yes, quickly opened one here: https://github.com/Open-EO/openeo-processes/issues/313
Thanks, helpful! Here is a sketch of the process(es), as I see them, high-level (for pixel-wise ML methods, such as RF). Following the ML terminology, I use labels for the response (e.g. crop type; either a class variable or a continuous variabe) and features for the predictors (e.g. the bands, or bands x time, based on which a RF predicts a class given a model).
As @mattia6690 notes, there are two separate steps: A train model, B predict on new features
See below for how we get to these input data, e.g. from polygons
data
: raster data cube with features as a dimensiondimension
: feature dimension namecontext
: "model" reducer
: needs to be defined: takes the model, returns the class reduce_dimension
process
: needs to be defined: takes the model, returns the class probabilitiesapply_dimension
with target_dimension = "class"
Typical steps needed before we can train the model (A3) are:
case A1: training data consist of polygons and their class values, where polygons are uniform in their class value. This needs a method to either:
POINT
geometries of pixel centers inside the polygon, with associated polygon IDoutput of A1: Point locations + labels -> go to case A2
case A2: extract features at the training point locations: we think this should happen with aggregate_spatial
when called with POINT
geometries (although no aggregation takes place):
data
: raster data cube with featuresgeometries
: POINT
locations + labelsreducer
: array_element
with index 0case A3: train model:
Note that step A1.2 + A2: for a set of polygons and a raster (cube), return the raster pixel centers and all the associated pixel values, is a very common operation; in R it is usually called extract
.
Nice overview! For A1.1 we will first write a script that does this client side, where we have all flexibility to do that in whatever way we like, but I'm not opposed to also defining it as a process, same for A1.2, which seems even simpler.
For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?
For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?
Yes, that's actually what we discussed yesterday but Edzer did not mention it explicitly. So to visualize it with a bit of JS-like pseudo-code for B1:
p = new ProcessBuilder()
cube = p.load_collection('S2')
model = p.load_ml_model('my_model_job')
reducer = function(data, context) {
return this.predict_rf(data = data, model = context)
}
x = p.reduce_dimension(data = cube, reducer = reducer, dimension = 'bands', context = model)
...
Not fully fleshed out yet, but to give an idea...
We need two (or one?) new processes for Random Forest that support classification and regression.
Would training happen outside of openEO for now?
Implementations:
PS: That's a lot of parameters, wow!
-> Related: save_model / load_model with GLMLC metadata: #300