Closed BugsBuggy closed 2 years ago
IIRC, the fANOVA library only works (under the hood) only with numeric values. If you are working with categorical variables, I think the best way is to encode it in one way or the other, e.g., by putting in the index of the categorical value.
Here is some example code how I worked around this: https://github.com/janvanrijn/openml-pimp/blob/master/examples/experiments/run_pimp_on_arff.py
Thanks a lot, this solved the initial problem. Still, nan
values do not seem to be accepted as this leads to the same error as above. Converting them to integers is valid for categorical variables but leads to errors in case of numerical ConfigSpace variables as they may not fall in between the lower and upper bounds (e.g. for the replacement nan
--> -math.inf
). Any suggestion how to deal with them?
That depends entirely on the situation (and probably goes beyond the topic of a github issue).
Can you give some examples where you have missing values? Do they have to do with conditional hyperparameters?
Yes, nan
values appear because there are conditions in the search space. For example if I train a model with different techniques which is a categorical hyperparameter itself.
Let's say each training technique has different hyperparameters to set which are only defined if the technique is used:
e.g. one training technique may require to set
samples_int = h.UniformIntegerHyperparameter('n_samples', lower=1, upper=10000, log=True)
some_float = h.UniformFloatHyperparameter('some_float', lower=1.0, upper=100.0, log=True)
samples_int
and samples_float
will have nan
values in some rows if another technique is used. These conditions may appear more often so we cannot just partition by the different techniques.
A simple way I thought of is to extend the lower or upper bounds and set the nan
values to one of the new limits. However, this is problematic for log-scaled values and may impact the interpretability of the fANOVA results in a negative way.
It is not entirely clear how functional ANOVA should work with the concept of conditional hyperparameters. The original papers do not really describe this situation as well. In the Hyperparameter Importance Across Datasets paper, we solved this by treating the conditional hyperparameters (such as SVM kernel) as three different algorithms. For this specific situation, that seemed like a reasonable choice. In other situations, a different choice should or could be made.
I understood that you are using functional ANOVA in the context of your Master thesis? In that case it would probably be good to discuss this matter with your thesis advisor. There is probably a lot of room for research in finding out how functional ANOVA should handle this.
Thanks a lot, yes you understood it correctly and this makes sense! I was missing the fact that conditional hyperparameters are not (yet) easy to handle with fANOVA.
Hello together, I'm trying to run fANOVA on mixed-type data with UniformInteger, Categorical and UniformFloat data. The ConfigSpace matches the dataframe columns and all bounds and types of features are correctly set.
There are strings in my DataFrame as categorical variables. It's nowhere mentioned how to deal with categorical variables other than specifying a ConfigSpace.
I get the following warning:
WARNING:root:low level library expects X argument to be float
and the following error:
The error can be traced back to the following low-level library call: https://github.com/automl/fanova/blob/7b58914b1c38ee60a173b260fc4bb269aaf8acb1/fanova/fanova.py#L171
Can the library deal with string categories in the DataFrame other than converting them to integers?
As a next step, if I convert the string categories to numerical variables the
choices
in ConfigSpace will not match the converted variables. Do the ConfigSpacechoices
have to be adapted to the numerical categories? It would be helpful to specify things like this in the documentation.