Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

Open soxofaan opened 2 years ago

soxofaan commented 2 years ago

I just noticed this inconsistency:

The process specs of fit_class_random_forest and fit_regr_random_forest do no have a default value for the max_variables parameter. e.g. https://github.com/Open-EO/openeo-processes/blob/589bfd29667f8bfa4dd40992ca4f5edcb2d753d2/proposals/fit_class_random_forest.json#L27-L44

In the python client we apparently have default null with behavior defined as:

Default value is null, which corresponds to the number of predictors divided by 3.

And in the VITO back-end implementation of fit_class_random_forest we even don't use the max_variables yet, and just use the default ("auto") behavior of Spark MlLib's RandomForest.

Obviously, there are a couple of things to be resolved. For the process spec of fit_class_random_forest and fit_regr_random_forest I would propose to allow a default value (null) for the max_variables parameter, with the behavior "back-end is free to chose the best strategy".

soxofaan commented 2 years ago

related tickets:

m-mohr commented 2 years ago

Just to clarify, the proposal was meant to say: Make it required (and as such have no default value, so the user needs to choose). The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

Required and without default value is how the processes are defined right now. So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

soxofaan commented 2 years ago

The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

You have a reproducibility problem anyway because of these wildly different implementations. Requiring the user to specify a max_variable makes it even less reproducible because they have to change the value each time they change back-end.

I think a ML/AI user is aware of the multitude of degrees of freedom in ML tech and is not going to expect reproducible models or results when switching to a different back-end with different technology. But by requiring to set max_variables even their code/script is not reproducible (because whatever they chose might not work on some back-end). By allowing default null, they can at least make the running of their code or algorithm reproducible or interoperable.

Likewise, it's is or will be annoying for documentation/demo/tutorial purposes if we can't give a generic fit_class_random_forest example snippet that will work without an interoperability disclaimer about a detail like max_variables.

So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

The inconsistency I want to refer to is between the spec in openeo-processes and the actual implementations as they current are.

soxofaan commented 2 years ago

it seems weird to have a default that does some kind of a black box thing.

To turn that around: each random forest implementation already has a default for their max_variables equivalent, so wouldn't it be weird that we don't?

I understand that you're worried that the behavior for max_variables=null is potentially different among back-ends, but as as user, when I "use" a default (e.g. by omitting an argument) I implicitly say "I trust the system to pick a good behavior for me and I don't care about those details". As a user I appreciate this feature not only in an actual RF model implementation (e.g. Spark MlLib or sklearn) but also on a higher or more abstract level like openEO processes.

soxofaan commented 2 years ago

My argument in short: in this case I think user friendliness and interoperability (of code) is more important than reproducibility of results (which is practically impossible here anyway I'm afraid)

m-mohr commented 2 years ago

Okay, but should we add a default value or remove it completely? #358

soxofaan commented 2 years ago

Well, I proposed to drop it in #358 to have a quick solution so that implementers could proceed in function of the SRR deadlines, with the idea to reintroduce it again at a later point once we have a better idea how to tackle the interoperability issues (string versus int, enum values, defaults, ...). While the time pressure is different now, I still think we can first drop it and reintroduce it later, unless you think that is silly or overkill.

To give an implementer's data point: in the VITO backend we practically ignore the max_variables parameter at the moment and just use Spark MlLib's default