automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.6k stars 1.28k forks source link

Dealing with non float features #651

Closed mmaher22 closed 4 years ago

mmaher22 commented 5 years ago

I get an error when using Datasets with Non float columns as they can't be converted from object data type to float data type. So, I'm wondering is it possible to use a dataset with non numeric features with auto sklearn?

and If no, I read in the paper that auto sklearn support the preprocessing algorithms like One-Hot-Encoding, Imputation of missing values, etc.

So, in this case, should I just hash each string value in a feature to a corresponding float number ?

-> Also, What about having some missing values? How should I convert them if auto sklearn can deal with missing attributes?

mfeurer commented 5 years ago

So, I'm wondering is it possible to use a dataset with non numeric features with auto sklearn?

No, they need to be transformed as for scikit-learn.

So, in this case, should I just hash each string value in a feature to a corresponding float number ?

For example. Basically you need to replace categories by integers, and need to perform some encoding like bag of words or TF/IDF for string features.

Also, What about having some missing values? How should I convert them if auto sklearn can deal with missing attributes?

The pipeline contains an imputer so you should just pass the missing values as np.NaN.

mmaher22 commented 5 years ago

Thanks very much

manugarri commented 5 years ago

@mfeurer correct me if im wrong , but in the autosklearn documentation it explicitly says that if you pass the feat_types argument it will onehot encode automatically the categorical features, is that not the case?

The reason i am asking is because i am running the latest version of autosklearn (0.5.2) and I am still getting the issue that @mmaher22 has.

EDIT: I was checking the project's source, and is it possible that by one hot encoding the docs mean to one hot encode "categorical variables encoded as numbers"?

mfeurer commented 5 years ago

EDIT: I was checking the project's source, and is it possible that by one hot encoding the docs mean to one hot encode "categorical variables encoded as numbers"?

Apparently, scikit-learn by now allows the usage of strings directly. We have not updated our code to cope with this change and you therefore, your finding that Auto-sklearn works with "categorical variables encoded as numbers" is correct.

I reopen this issue to remind us to allow Auto-sklearn to work with strings representing categorical data.

manugarri commented 5 years ago

Yeah i think that, since there is already a one hot encoder automatically added for categoricals, it would be easy to replace it with either a pure sklearn pipeline of LabelEncoder + OneHot or category_encoders onehotencoder.

mfeurer commented 4 years ago

Okay, auto-sklearn is now able to handle non-float features as part of a pandas dataframe as demonstrated in the following example: https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_pandas_train_test.html#sphx-glr-examples-40-advanced-example-pandas-train-test-py