Dealing with categorical string variables

BugsBuggy commented 2 years ago

Hello together, I'm trying to run fANOVA on mixed-type data with UniformInteger, Categorical and UniformFloat data. The ConfigSpace matches the dataframe columns and all bounds and types of features are correctly set.

There are strings in my DataFrame as categorical variables. It's nowhere mentioned how to deal with categorical variables other than specifying a ConfigSpace.

I get the following warning: WARNING:root:low level library expects X argument to be float

and the following error: fanova_error

The error can be traced back to the following low-level library call: https://github.com/automl/fanova/blob/7b58914b1c38ee60a173b260fc4bb269aaf8acb1/fanova/fanova.py#L171

Can the library deal with string categories in the DataFrame other than converting them to integers?

As a next step, if I convert the string categories to numerical variables the choices in ConfigSpace will not match the converted variables. Do the ConfigSpace choices have to be adapted to the numerical categories? It would be helpful to specify things like this in the documentation.

janvanrijn commented 2 years ago

IIRC, the fANOVA library only works (under the hood) only with numeric values. If you are working with categorical variables, I think the best way is to encode it in one way or the other, e.g., by putting in the index of the categorical value.

Here is some example code how I worked around this: https://github.com/janvanrijn/openml-pimp/blob/master/examples/experiments/run_pimp_on_arff.py

BugsBuggy commented 2 years ago

Thanks a lot, this solved the initial problem. Still, nan values do not seem to be accepted as this leads to the same error as above. Converting them to integers is valid for categorical variables but leads to errors in case of numerical ConfigSpace variables as they may not fall in between the lower and upper bounds (e.g. for the replacement nan --> -math.inf). Any suggestion how to deal with them?

janvanrijn commented 2 years ago

That depends entirely on the situation (and probably goes beyond the topic of a github issue).

Can you give some examples where you have missing values? Do they have to do with conditional hyperparameters?

BugsBuggy commented 2 years ago

Yes, nan values appear because there are conditions in the search space. For example if I train a model with different techniques which is a categorical hyperparameter itself.

Let's say each training technique has different hyperparameters to set which are only defined if the technique is used: e.g. one training technique may require to set samples_int = h.UniformIntegerHyperparameter('n_samples', lower=1, upper=10000, log=True) some_float = h.UniformFloatHyperparameter('some_float', lower=1.0, upper=100.0, log=True)

samples_int and samples_floatwill have nan values in some rows if another technique is used. These conditions may appear more often so we cannot just partition by the different techniques.

A simple way I thought of is to extend the lower or upper bounds and set the nan values to one of the new limits. However, this is problematic for log-scaled values and may impact the interpretability of the fANOVA results in a negative way.

janvanrijn commented 2 years ago

It is not entirely clear how functional ANOVA should work with the concept of conditional hyperparameters. The original papers do not really describe this situation as well. In the Hyperparameter Importance Across Datasets paper, we solved this by treating the conditional hyperparameters (such as SVM kernel) as three different algorithms. For this specific situation, that seemed like a reasonable choice. In other situations, a different choice should or could be made.

I understood that you are using functional ANOVA in the context of your Master thesis? In that case it would probably be good to discuss this matter with your thesis advisor. There is probably a lot of room for research in finding out how functional ANOVA should handle this.

BugsBuggy commented 2 years ago

Thanks a lot, yes you understood it correctly and this makes sense! I was missing the fact that conditional hyperparameters are not (yet) easy to handle with fANOVA.

automl / fanova

Dealing with categorical string variables #107