Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 16 forks source link

fit_class_random_forest: max_variables parameter could additionally be of type string #355

Closed JeroenVerstraelen closed 2 years ago

JeroenVerstraelen commented 2 years ago

Describe the issue: In Pyspark and Sklearn max_variables appears to coincide with these parameters: Pyspark: featureSubsetStrategy Sklearn: max_features

Spark requires one of these strings [“auto”, “all”, “sqrt”, “log2”, “onethird”], while sklearn requires one of these strings [“auto”, “sqrt”, “log2”], an int or a float.

An integer seems like the best general type to support all libraries, but it might be beneficial for the user if we also support string types. That way they don't have to calculate e.g. the sqrt or log2 themselves. Is this possible in openEO?

m-mohr commented 2 years ago

So there's no way to implement the nuermical behavior defined right now in pyspark? I guess I need to do another crosswalk and check commonly used options so that we can find a good subset that everyone can implement.

JeroenVerstraelen commented 2 years ago

As far as I can see there is no support for numerical values in pyspark. We plan to automatically convert the provided integer to one of the possible strings in pyspark. If there is no associated string then we will return an invalid parameter error.

m-mohr commented 2 years ago
Library Parameter Pre-defined string options Float for fraction Integer for number of vars
Pyspark (Py) featureSubsetStrategy auto, all, sqrt, log2, onethird No No
Sklearn (Py) max_features auto, sqrt, log2 Yes Yes
ranger (R) mtry sqrt No Yes
randomForest (R) mtry sqrt (for classification), onethird (for regression) Yes No
RandomForests (Fortran) mtry0 sqrt No Yes
Vigra (C++) features_per_node all, sqrt, log Yes Yes

In openEO we can't distinguish between integers and floats, so can't allow both float and int separately. Seeing this table above, I'd propose adding string values (all, sqrt, log2, onethird) and ints for the number of vars, but to NOT provide a default value.

m-mohr commented 2 years ago

@JeroenVerstraelen Please review PR #351