Open soxofaan opened 2 years ago
related tickets:
max_variables
parameter for now (until we find a good common ground for all implementations we want to support)I'd propose adding ..., but to NOT provide a default value.
Just to clarify, the proposal was meant to say: Make it required (and as such have no default value, so the user needs to choose). The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.
Required and without default value is how the processes are defined right now. So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?
The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.
You have a reproducibility problem anyway because of these wildly different implementations. Requiring the user to specify a max_variable
makes it even less reproducible because they have to change the value each time they change back-end.
I think a ML/AI user is aware of the multitude of degrees of freedom in ML tech and is not going to expect reproducible models or results when switching to a different back-end with different technology. But by requiring to set max_variables
even their code/script is not reproducible (because whatever they chose might not work on some back-end). By allowing default null
, they can at least make the running of their code or algorithm reproducible or interoperable.
Likewise, it's is or will be annoying for documentation/demo/tutorial purposes if we can't give a generic fit_class_random_forest
example snippet that will work without an interoperability disclaimer about a detail like max_variables
.
So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?
The inconsistency I want to refer to is between the spec in openeo-processes and the actual implementations as they current are.
it seems weird to have a default that does some kind of a black box thing.
To turn that around: each random forest implementation already has a default for their max_variables
equivalent, so wouldn't it be weird that we don't?
I understand that you're worried that the behavior for max_variables=null
is potentially different among back-ends, but as as user, when I "use" a default (e.g. by omitting an argument) I implicitly say "I trust the system to pick a good behavior for me and I don't care about those details".
As a user I appreciate this feature not only in an actual RF model implementation (e.g. Spark MlLib or sklearn) but also on a higher or more abstract level like openEO processes.
My argument in short: in this case I think user friendliness and interoperability (of code) is more important than reproducibility of results (which is practically impossible here anyway I'm afraid)
Okay, but should we add a default value or remove it completely? #358
Well, I proposed to drop it in #358 to have a quick solution so that implementers could proceed in function of the SRR deadlines, with the idea to reintroduce it again at a later point once we have a better idea how to tackle the interoperability issues (string versus int, enum values, defaults, ...). While the time pressure is different now, I still think we can first drop it and reintroduce it later, unless you think that is silly or overkill.
To give an implementer's data point: in the VITO backend we practically ignore the max_variables
parameter at the moment and just use Spark MlLib's default
I just noticed this inconsistency:
The process specs of
fit_class_random_forest
andfit_regr_random_forest
do no have a default value for themax_variables
parameter. e.g. https://github.com/Open-EO/openeo-processes/blob/589bfd29667f8bfa4dd40992ca4f5edcb2d753d2/proposals/fit_class_random_forest.json#L27-L44In the python client we apparently have default
null
with behavior defined as:And in the VITO back-end implementation of fit_class_random_forest we even don't use the max_variables yet, and just use the default ("auto") behavior of Spark MlLib's RandomForest.
Obviously, there are a couple of things to be resolved. For the process spec of
fit_class_random_forest
andfit_regr_random_forest
I would propose to allow a default value (null
) for themax_variables
parameter, with the behavior "back-end is free to chose the best strategy".