predict_class and predict_probabilities

soxofaan commented 2 years ago

@JeroenVerstraelen is working on an implementation of CatBoost base ML in the VITO backend and while discussing details a couple of things came up:

the predict_catboost process would be practically identical to predict_random_forest, except for some textual differences in title and descriptions. Turns out that it is not really necessary to define a dedicated predict_ process for each kind of machine learning model: all the model details are embedded in the ml-model object and you could just use a single predict(data: array, model: ml-model) for all kinds of ML models.
for some use cases we want to predict the probability of each class instead of a single class prediction. We first considered adding a parameter to toggle between class output or probabilities output, but that would mean that the output type would change: scalar for class prediction and array for probability prediction. Moreover, the former has to be used in reduce_dimension and the other in apply_dimension. It felt error prone and confusing to let these two different patterns depend on a rather inconspicuous boolean parameter. It might be better to have a separate processes for class prediction and probabilities prediction

So with this background, the proposal is to introduce two generic ml prediction processes:

predict_class(data: array, model: ml-model) -> number
predict_probabilities(data: array, model: ml-model) -> array

both can be easily spec'ed based on current https://github.com/Open-EO/openeo-processes/blob/draft/proposals/predict_random_forest.json

m-mohr commented 2 years ago

Sounds reasonable. What do others think? @clausmichele @LukeWeidenwalker

LukeWeidenwalker commented 2 years ago

Some thoughts:

predict_class() is conceptually just an argmax over the probabilities, so predict_probabilities() + something extra. Would it make sense to introduce an argmax process instead? I don't think it's asking too much of users to understand that for classification you need probabilities for each class.
This process only makes sense for classification models. sklearn has this taxonomy for models (see Base classes -> Regressor, Classifier, Cluster, etc.). If we want to know exactly what the output from an inference process will be (scalar? array of probabilities? array of indices (in the case of clustering)?), we need to know what kind of model we're dealing with. Curious what others think, but I'd be reluctant to go down a path where openeo attempts to replicate significant pieces of sklearn.
Without replicating/exposing large parts of already existing ML libraries (e.g. sklearn, pytorch/tensorflow, catboost), can openeo enable users to train quality models? E.g. what about model evaluation? Training data selection? Hyperparameter tuning? None of my usual ecosystem is available, so tbh I'd have a hard time building models whose performance I'm actually confident in.
I don't have the historic context on this, but is the vision of openeo really to do 1) distributed 2) machine-learning 3) on EO data? Or is that an instance of feature creep?
I'm a big fan of being able to interface with existing/pretrained models for inference from within an openeo process graph. The ml-model STAC extension already does most of the heavy lifting towards enabling that. I'd love to hear people's thoughts about a world where openeo only exposes models that have already been trained elsewhere (and evaluated + quality controlled!).

Sorry for the wall of text - I realize this is out of scope for this issue, feel free to redirect this discussion somewhere more suitable.

soxofaan commented 2 years ago

predict_class() is conceptually just an argmax over the probabilities, so predict_probabilities() + something extra.

In the VITO backend we had problems with getting the probabilities out of Spark's RandomForrest implementation, while getting the class was straightforward. So I think it's best to have predict_class independent from predict_probabilities. I also guess that predict_class usage will be more common than predict_probabilities, so I would not make the former more cumbersome to use than the latter.

This process only makes sense for classification models.

Indeed, this proposal is only for classification models. In that light, it might be more future proof to use predict_class_probabilities instead of predict_probabilities

but I'd be reluctant to go down a path where openeo attempts to replicate significant pieces of sklearn.

on the level of openEO process we would, at best, replicate the API of sklearn in some abstracted way, which doesn't seem like a bad thing to me.

Without replicating/exposing large parts of already existing ML libraries ..., can openeo enable users to train quality models? E.g. what about model evaluation? Training data selection? Hyperparameter tuning?

The long term goal is probably to do this, but at the moment we are just focused on training and inference as the core ML building blocks. All the rest (evaluation, tuning, ...) is now expected to be done client side. Which gives the user the most flexibility anyway.

I don't have the historic context on this, but is the vision of openeo really to do 1) distributed 2) machine-learning 3) on EO data? Or is that an instance of feature creep?

openEO won't stay relevant if ML/AI in some form isn't part of the offering I'd think. In SRR3 we already did use cases with ML.

m-mohr commented 1 year ago

I'm +1 on the original proposal. What do others think? How do we proceed?

clausmichele commented 1 year ago

I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.

gilbertocamara commented 1 year ago

Dear @m-mohr @clausmichele @edzer @dthiex, some thoughts on ML/DL processes in openEO:

IMHO, a recommended generic set of functions for ML/DL processes in openEO is:

train (data: training set, ml_method: ML/DL algorithm with optional params) -> model_type

predict (data: data_cube, model: model_type) -> probs_cube

smooth (data: probs_cube, smooth_function: function) -> probs_cube

label (data:probs_cube) –> map

Some relevant points regarding the above:

There should be a generic data type that describes a training data set, whose elements include: lat/long locations and/or polygon geometries, classes or values associated to them, and (optional) multidimensional time series values for the location. This data type needs to support both regression and classification, both for individual images and image time series.
The parameter ml_method in the train_model function should be an abstract data type that would be able to fit different kinds of ML/DL algorithms. One way of doing is to use a generic function that provides functions train and predict.
The model_type abstract data type should be a closure, which is an object that stores a function together with an environment. The model_type ADT will contain captured variables that allow it to predict a new value given an input. By using closures, the model_type ADT will be completely generic. In this way, it will be much easier to extend openEO to include new models.
It is recommended to define the predict function to be applied to a data_cube instead of an array. Data cubes have a rigorous definitions (e.g, Appel and Pebesma). When using image time series, ML/DL algorithms work best with data cubes which are regular in space and time. In most programming environments, array is a loosely defined data type. Arrays can be sparse or have missing values, whereas data cubes need to be regular and dense in space and time.
The output of the predict function works best if it is a set of probability maps. The literature on EO image classification is consistent in its recommendation for the need of post-processing of ML/DL classification results. Including a predict_label function is not recommended, because in general the results will be worse that those obtained after post-processing.
The smooth function, as stated before, performs an important role in removing outliers. Its input would be a data cube with probability maps and a smoothing algorithm. Its output would be a set of probability maps.
The label function is the easy one. It performs a softmax in the probability maps and outputs a labelled map.

The API proposed above would be a minimal set of ML/DL classification functions for openEO. Extensions that could be considered later include:

(a) Measuring classification uncertainty. (b) Model tuning. (c) Active learning methods.

Full disclosure: this proposal is based on our 6-year experience in the development of SITS. All of the above functions (and extensions) are implemented and operational in SITS. In terms of openEO, this means that such an API for openEO will be readily implemented in a SITS back-end. Hopefully, this would motivate other openEO back-end developers to follow suit.

gilbertocamara commented 1 year ago

Dear @clausmichele

I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.

Not sure I agree. UDFs are non-standard and should be discouraged, because they fragment the openEO landscape and thus would undermine the purpose of openEO which is to have a single interface supported by different back-ends.

m-mohr commented 1 year ago

Potentially interesting for "bring your own model": https://onnx.ai/

Open-EO / openeo-processes

predict_class and predict_probabilities #368