EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.66k stars 1.56k forks source link

Question - Support for different types of categorical variable encoding #1237

Open SSMK-wq opened 2 years ago

SSMK-wq commented 2 years ago

Hi,

Does Tpot offer any automated way to convert categorical feature into encoded variables.

Context of the issue

I have an input dataset with more than 100 variables where around 80% of the variables are categorical in nature.

While some variables like gender, country etc can be one-hot encoded but I also have few variables which have an inherent order in their values such rating - Very good, good, bad etc.

Is there any approach/option in Tpot which we can use to do this encoding based on the variable type.

For ex: I would like to provide the below two lists as input to the tpot auto-ml arguments.

one-hot-list = ['Gender', 'Country'] #one-hot encoding ordinal_list = ['Feedback', 'Level_of_interest'] #ordinal encoding

Is there any option in the package that can do this for us?

Or is there any other efficient way to do this as I have 80 categorical columns

fjpa121197 commented 2 years ago

Hi @SSMK-wq,

did you find a work around to this? I don't see any documentation saying that TPOT handles encoding of categorical features, or different/predefined encoding, for example, ordinal vs one-hot encoding.

spenceforce commented 2 years ago

Bumping this as it would be nice to pass categorical features to tpot. Tpot includes OneHotEncoder in its default estimator set for regressions, but it's only usable for integers as it stands. I see the fit method throws an error on np.isnan. I'm sure there's more to it than changing that though.