Closed JeroenVerstraelen closed 2 years ago
So there's no way to implement the nuermical behavior defined right now in pyspark? I guess I need to do another crosswalk and check commonly used options so that we can find a good subset that everyone can implement.
As far as I can see there is no support for numerical values in pyspark. We plan to automatically convert the provided integer to one of the possible strings in pyspark. If there is no associated string then we will return an invalid parameter error.
Library | Parameter | Pre-defined string options | Float for fraction | Integer for number of vars |
---|---|---|---|---|
Pyspark (Py) | featureSubsetStrategy | auto, all, sqrt, log2, onethird | No | No |
Sklearn (Py) | max_features | auto, sqrt, log2 | Yes | Yes |
ranger (R) | mtry | sqrt | No | Yes |
randomForest (R) | mtry | sqrt (for classification), onethird (for regression) | Yes | No |
RandomForests (Fortran) | mtry0 | sqrt | No | Yes |
Vigra (C++) | features_per_node | all, sqrt, log | Yes | Yes |
In openEO we can't distinguish between integers and floats, so can't allow both float and int separately. Seeing this table above, I'd propose adding string values (all, sqrt, log2, onethird) and ints for the number of vars, but to NOT provide a default value.
@JeroenVerstraelen Please review PR #351
Describe the issue: In Pyspark and Sklearn max_variables appears to coincide with these parameters: Pyspark: featureSubsetStrategy Sklearn: max_features
Spark requires one of these strings [“auto”, “all”, “sqrt”, “log2”, “onethird”], while sklearn requires one of these strings [“auto”, “sqrt”, “log2”], an int or a float.
An integer seems like the best general type to support all libraries, but it might be beneficial for the user if we also support string types. That way they don't have to calculate e.g. the sqrt or log2 themselves. Is this possible in openEO?