Open soxofaan opened 2 years ago
Sounds reasonable. What do others think? @clausmichele @LukeWeidenwalker
Some thoughts:
predict_class()
is conceptually just an argmax
over the probabilities, so predict_probabilities()
+ something extra. Would it make sense to introduce an argmax
process instead? I don't think it's asking too much of users to understand that for classification you need probabilities for each class. Regressor
, Classifier
, Cluster
, etc.). If we want to know exactly what the output from an inference process will be (scalar? array of probabilities? array of indices (in the case of clustering)?), we need to know what kind of model we're dealing with. Curious what others think, but I'd be reluctant to go down a path where openeo attempts to replicate significant pieces of sklearn.Sorry for the wall of text - I realize this is out of scope for this issue, feel free to redirect this discussion somewhere more suitable.
predict_class()
is conceptually just an argmax over the probabilities, sopredict_probabilities()
+ something extra.
In the VITO backend we had problems with getting the probabilities out of Spark's RandomForrest implementation, while getting the class was straightforward. So I think it's best to have predict_class
independent from predict_probabilities
. I also guess that predict_class
usage will be more common than predict_probabilities
, so I would not make the former more cumbersome to use than the latter.
This process only makes sense for classification models.
Indeed, this proposal is only for classification models. In that light, it might be more future proof to use predict_class_probabilities
instead of predict_probabilities
but I'd be reluctant to go down a path where openeo attempts to replicate significant pieces of sklearn.
on the level of openEO process we would, at best, replicate the API of sklearn in some abstracted way, which doesn't seem like a bad thing to me.
Without replicating/exposing large parts of already existing ML libraries ..., can openeo enable users to train quality models? E.g. what about model evaluation? Training data selection? Hyperparameter tuning?
The long term goal is probably to do this, but at the moment we are just focused on training and inference as the core ML building blocks. All the rest (evaluation, tuning, ...) is now expected to be done client side. Which gives the user the most flexibility anyway.
I don't have the historic context on this, but is the vision of openeo really to do 1) distributed 2) machine-learning 3) on EO data? Or is that an instance of feature creep?
openEO won't stay relevant if ML/AI in some form isn't part of the offering I'd think. In SRR3 we already did use cases with ML.
I'm +1 on the original proposal. What do others think? How do we proceed?
I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.
Dear @m-mohr @clausmichele @edzer @dthiex, some thoughts on ML/DL processes in openEO:
IMHO, a recommended generic set of functions for ML/DL processes in openEO is:
train (data: training set, ml_method: ML/DL algorithm with optional params) -> model_type
predict (data: data_cube, model: model_type) -> probs_cube
smooth (data: probs_cube, smooth_function: function) -> probs_cube
label (data:probs_cube) –> map
Some relevant points regarding the above:
There should be a generic data type that describes a training data set, whose elements include: lat/long locations and/or polygon geometries, classes or values associated to them, and (optional) multidimensional time series values for the location. This data type needs to support both regression and classification, both for individual images and image time series.
The parameter ml_method
in the train_model
function should be an abstract data type that would be able to fit different kinds of ML/DL algorithms. One way of doing is to use a generic function that provides functions train
and predict
.
The model_type
abstract data type should be a closure
, which is an object that stores a function together with an environment. The model_type
ADT will contain captured variables that allow it to predict a new value given an input. By using closures, the model_type
ADT will be completely generic. In this way, it will be much easier to extend openEO to include new models.
It is recommended to define the predict
function to be applied to a data_cube
instead of an array
. Data cubes have a rigorous definitions (e.g, Appel and Pebesma). When using image time series, ML/DL algorithms work best with data cubes which are regular in space and time. In most programming environments, array
is a loosely defined data type. Arrays can be sparse or have missing values, whereas data cubes need to be regular and dense in space and time.
The output of the predict
function works best if it is a set of probability maps. The literature on EO image classification is consistent in its recommendation for the need of post-processing of ML/DL classification results. Including a predict_label
function is not recommended, because in general the results will be worse that those obtained after post-processing.
The smooth
function, as stated before, performs an important role in removing outliers. Its input would be a data cube with probability maps and a smoothing algorithm. Its output would be a set of probability maps.
The label
function is the easy one. It performs a softmax in the probability maps and outputs a labelled map.
The API proposed above would be a minimal set of ML/DL classification functions for openEO. Extensions that could be considered later include:
(a) Measuring classification uncertainty. (b) Model tuning. (c) Active learning methods.
Full disclosure: this proposal is based on our 6-year experience in the development of SITS. All of the above functions (and extensions) are implemented and operational in SITS. In terms of openEO, this means that such an API for openEO will be readily implemented in a SITS back-end. Hopefully, this would motivate other openEO back-end developers to follow suit.
Dear @clausmichele
I also agree with the proposal. We should move forward with the ML processes to keep openEO attractive! I also partly agree with Lukas, but I would say that for more advanced users and models we could still use UDFs.
Not sure I agree. UDFs are non-standard and should be discouraged, because they fragment the openEO landscape and thus would undermine the purpose of openEO which is to have a single interface supported by different back-ends.
Potentially interesting for "bring your own model": https://onnx.ai/
@JeroenVerstraelen is working on an implementation of CatBoost base ML in the VITO backend and while discussing details a couple of things came up:
predict_catboost
process would be practically identical topredict_random_forest
, except for some textual differences in title and descriptions. Turns out that it is not really necessary to define a dedicatedpredict_
process for each kind of machine learning model: all the model details are embedded in theml-model
object and you could just use a singlepredict(data: array, model: ml-model)
for all kinds of ML models.reduce_dimension
and the other inapply_dimension
. It felt error prone and confusing to let these two different patterns depend on a rather inconspicuous boolean parameter. It might be better to have a separate processes for class prediction and probabilities predictionSo with this background, the proposal is to introduce two generic ml prediction processes:
predict_class(data: array, model: ml-model) -> number
predict_probabilities(data: array, model: ml-model) -> array
both can be easily spec'ed based on current https://github.com/Open-EO/openeo-processes/blob/draft/proposals/predict_random_forest.json