fit_class_random_forest: max_variables parameter could additionally be of type string

Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.

https://processes.openeo.org

Apache License 2.0

48 stars 16 forks source link

fit_class_random_forest: max_variables parameter could additionally be of type string #355

Closed JeroenVerstraelen closed 2 years ago

JeroenVerstraelen commented 2 years ago

Describe the issue: In Pyspark and Sklearn max_variables appears to coincide with these parameters: Pyspark: featureSubsetStrategy Sklearn: max_features

Spark requires one of these strings [“auto”, “all”, “sqrt”, “log2”, “onethird”], while sklearn requires one of these strings [“auto”, “sqrt”, “log2”], an int or a float.

An integer seems like the best general type to support all libraries, but it might be beneficial for the user if we also support string types. That way they don't have to calculate e.g. the sqrt or log2 themselves. Is this possible in openEO?

m-mohr commented 2 years ago

So there's no way to implement the nuermical behavior defined right now in pyspark? I guess I need to do another crosswalk and check commonly used options so that we can find a good subset that everyone can implement.

JeroenVerstraelen commented 2 years ago

As far as I can see there is no support for numerical values in pyspark. We plan to automatically convert the provided integer to one of the possible strings in pyspark. If there is no associated string then we will return an invalid parameter error.

m-mohr commented 2 years ago

Library	Parameter	Pre-defined string options	Float for fraction	Integer for number of vars
Pyspark (Py)	featureSubsetStrategy	auto, all, sqrt, log2, onethird	No	No
Sklearn (Py)	max_features	auto, sqrt, log2	Yes	Yes
ranger (R)	mtry	sqrt	No	Yes
randomForest (R)	mtry	sqrt (for classification), onethird (for regression)	Yes	No
RandomForests (Fortran)	mtry0	sqrt	No	Yes
Vigra (C++)	features_per_node	all, sqrt, log	Yes	Yes

In openEO we can't distinguish between integers and floats, so can't allow both float and int separately. Seeing this table above, I'd propose adding string values (all, sqrt, log2, onethird) and ints for the number of vars, but to NOT provide a default value.

m-mohr commented 2 years ago

@JeroenVerstraelen Please review PR #351