Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 14 forks source link

predict_class and predict_probabilities #368

Open soxofaan opened 2 years ago

soxofaan commented 2 years ago

@JeroenVerstraelen is working on an implementation of CatBoost base ML in the VITO backend and while discussing details a couple of things came up:

So with this background, the proposal is to introduce two generic ml prediction processes:

both can be easily spec'ed based on current https://github.com/Open-EO/openeo-processes/blob/draft/proposals/predict_random_forest.json

m-mohr commented 2 years ago

Sounds reasonable. What do others think? @clausmichele @LukeWeidenwalker

LukeWeidenwalker commented 2 years ago

Some thoughts:

Sorry for the wall of text - I realize this is out of scope for this issue, feel free to redirect this discussion somewhere more suitable.

soxofaan commented 2 years ago

predict_class() is conceptually just an argmax over the probabilities, so predict_probabilities() + something extra.

In the VITO backend we had problems with getting the probabilities out of Spark's RandomForrest implementation, while getting the class was straightforward. So I think it's best to have predict_class independent from predict_probabilities. I also guess that predict_class usage will be more common than predict_probabilities, so I would not make the former more cumbersome to use than the latter.

This process only makes sense for classification models.

Indeed, this proposal is only for classification models. In that light, it might be more future proof to use predict_class_probabilities instead of predict_probabilities

but I'd be reluctant to go down a path where openeo attempts to replicate significant pieces of sklearn.

on the level of openEO process we would, at best, replicate the API of sklearn in some abstracted way, which doesn't seem like a bad thing to me.

Without replicating/exposing large parts of already existing ML libraries ..., can openeo enable users to train quality models? E.g. what about model evaluation? Training data selection? Hyperparameter tuning?

The long term goal is probably to do this, but at the moment we are just focused on training and inference as the core ML building blocks. All the rest (evaluation, tuning, ...) is now expected to be done client side. Which gives the user the most flexibility anyway.

I don't have the historic context on this, but is the vision of openeo really to do 1) distributed 2) machine-learning 3) on EO data? Or is that an instance of feature creep?

openEO won't stay relevant if ML/AI in some form isn't part of the offering I'd think. In SRR3 we already did use cases with ML.

m-mohr commented 1 year ago

I'm +1 on the original proposal. What do others think? How do we proceed?

clausmichele commented 1 year ago

I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.

gilbertocamara commented 1 year ago

Dear @m-mohr @clausmichele @edzer @dthiex, some thoughts on ML/DL processes in openEO:

IMHO, a recommended generic set of functions for ML/DL processes in openEO is:

train (data: training set, ml_method: ML/DL algorithm with optional params) -> model_type

predict (data: data_cube, model: model_type) -> probs_cube

smooth (data: probs_cube, smooth_function: function) -> probs_cube

label (data:probs_cube) –> map

Some relevant points regarding the above:

  1. There should be a generic data type that describes a training data set, whose elements include: lat/long locations and/or polygon geometries, classes or values associated to them, and (optional) multidimensional time series values for the location. This data type needs to support both regression and classification, both for individual images and image time series.

  2. The parameter ml_method in the train_model function should be an abstract data type that would be able to fit different kinds of ML/DL algorithms. One way of doing is to use a generic function that provides functions train and predict.

  3. The model_type abstract data type should be a closure, which is an object that stores a function together with an environment. The model_type ADT will contain captured variables that allow it to predict a new value given an input. By using closures, the model_type ADT will be completely generic. In this way, it will be much easier to extend openEO to include new models.

  4. It is recommended to define the predict function to be applied to a data_cube instead of an array. Data cubes have a rigorous definitions (e.g, Appel and Pebesma). When using image time series, ML/DL algorithms work best with data cubes which are regular in space and time. In most programming environments, array is a loosely defined data type. Arrays can be sparse or have missing values, whereas data cubes need to be regular and dense in space and time.

  5. The output of the predict function works best if it is a set of probability maps. The literature on EO image classification is consistent in its recommendation for the need of post-processing of ML/DL classification results. Including a predict_label function is not recommended, because in general the results will be worse that those obtained after post-processing.

  6. The smooth function, as stated before, performs an important role in removing outliers. Its input would be a data cube with probability maps and a smoothing algorithm. Its output would be a set of probability maps.

  7. The label function is the easy one. It performs a softmax in the probability maps and outputs a labelled map.

The API proposed above would be a minimal set of ML/DL classification functions for openEO. Extensions that could be considered later include:

(a) Measuring classification uncertainty. (b) Model tuning. (c) Active learning methods.

Full disclosure: this proposal is based on our 6-year experience in the development of SITS. All of the above functions (and extensions) are implemented and operational in SITS. In terms of openEO, this means that such an API for openEO will be readily implemented in a SITS back-end. Hopefully, this would motivate other openEO back-end developers to follow suit.

gilbertocamara commented 1 year ago

Dear @clausmichele

I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.

Not sure I agree. UDFs are non-standard and should be discouraged, because they fragment the openEO landscape and thus would undermine the purpose of openEO which is to have a single interface supported by different back-ends.

m-mohr commented 1 year ago

Potentially interesting for "bring your own model": https://onnx.ai/