Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 14 forks source link

ML Generic API #497

Open PondiB opened 4 months ago

PondiB commented 4 months ago

I saw the need for this as an alternative for #441 , if we keep following the pattern of *_mlfit_** as #484 , #483 , #493 and so forth. We will end up with a long list of ml processes.

Getting inspiration from tools like caret in R ecosystem. We can ideally achieve a minimalistic API.

Labels can be used to identify a classification and regression problem.

Parameters for various ml models can be passed as dictionaries: e.g.

  1. Random Forest : ml_fit(predictor, label, ml_method ="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})
  2. Multi-layer Perceptron classifier ml_fit(predictor, label, ml_method = "mlp", params = {"hidden_layers": [ 512, 512, 512], "dropout_rates": [0.4, 0.3, 0.2], "epochs": 100, "activation_function": "relu", "optimizer": "adam"}

Well, there will be need a for clear documentation dedicated to ML on the openEO website.

Additionally, a need for 2 more methods on the client side list_ml_models() for listing the available models and model_param_blueprint(model_name, task_type="classification") for listing the default params of the models. The information should ideally be on the website too.

Opinions, suggestions and critics are welcomed as I am still exploring this.

soxofaan commented 4 months ago

We can ideally achieve a minimalistic API.

I'm not sure that that should be the goal. You indeed have a small API at the level of the processes (just one process ml_fit), but you pay the cost in other places: those list_ml_models() and model_param_blueprint() have to be defined, you need to provide schema definition/validation for the params of each model, all clients have to start supporting that, all back-ends have to start implementing that, ... It feels like a lot of reinventing things we already have in our current ecosystem of processes. I'm not sure there is a net win here.

We will end up with a long list of ml processes

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

PondiB commented 4 months ago

I'm not sure that that should be the goal.

Thanks for your solid comments that cannot be overlooked, it reminds me of this article ; Simple is not easy. There will be an initial Engineering overhead for sure as you have pointed out. My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

On my end, It does not mean we should have 20+ processes for ML/DL.

soxofaan commented 4 months ago

My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

Even for the user I think it will be worse.

e.g. with the python client, the current way of working would result in something like

ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Which is easy/simple and highly accessible I would think: it just uses the basics of how functions are called, everything is documented the normal way, not only the available arguments and their defaults, but also basics of random forest, ...

the proposed approach would result in something like

ml_fit(predictors=cube1, labels=cube2, ml_method="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

PondiB commented 4 months ago

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

Fair enough. I will continue with the specific APIs. Now, one last thought, for classification and regression tasks if we are limiting the number of params to tune and they are similar then there is no need to have 2 APIs for the categories. i.e.

# classification
ml_fit_class_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)
# regression
ml_fit_regr_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

the 2 above can be replaced with

# classification or regression
ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Therefore internally it can be determined from the labels if this is a classification or a regression task. Do you foresee any bottlenecks with this?

soxofaan commented 4 months ago

Therefore internally it can be determined from the labels if this is a classification or a regression task

I'm not sure this is a safe assumption. I can imagine there are users that use e.g. integer values as classification labels.

PondiB commented 1 month ago

I'm not sure this is a safe assumption. I can imagine there are users that use e.g. integer values as classification labels.

I was on vacation for over a month during this time, so I left this discussion on hold. I'll look into your point.

Anyway, regarding the parameters, I think we should not separate the predictor and target. Instead, we can have a "training-data" parameter that includes both predictors and targets. A target variable should just specify the column name of the targets/labels within the training data.

FYI: An example training data: training-data