ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.72k stars 12.71k forks source link

[IDEA] Chapter 8 - Reduce dimensionality for datasets with categorical features #411

Open marcio1191 opened 3 years ago

marcio1191 commented 3 years ago

Hello In this book its not covered how to handle dimensionality reduction for datasets with categorical features too. How would you handle these situations? Thank you in advance Regards

ageron commented 3 years ago

Hi @marcio1191, That's an interesting question! You could simply one-hot encode the categorical features and apply the dimensionality reduction algorithm after that. If you're training a neural network, you could apply the dimensionality reduction algorithm on the dataset excluding the categorical features, then add the categorical features as trainable embeddings. Hope this helps.

marcio1191 commented 3 years ago

Thanks @ageron for answering so quickly. The problem with one-hot encode is that for visualization and data reduction techniques, like t-SNE and PCA, there is no "distance/variance" meaning associated with that kind of encoding. The algorithm will work, however, there is no meaning associated with those features. https://stackoverflow.com/questions/40795141/pca-for-categorical-features

ageron commented 3 years ago

Thanks @marcio1191 , it seems I answered too quickly! 😅 You're right, if you choose the first option (using one-hot encoding followed by dim reduction), you have to be careful to use a clustering algorithm compatible with binary values (PCA definitely isn't). I haven't looked into this question very closely, but it seems that Multiple Component Analysis should work.

A couple other approaches you could try:

Hope this helps.

marcio1191 commented 3 years ago

@ageron, Thank you for your help. Regards Marcio Fernandes

memo26167 commented 2 years ago

Hello, In this case, maybe encoding methods like Target Encoding or CatBoost encoding may help. There are multiple category encoders in http://contrib.scikit-learn.org/category_encoders/. Kind Regards Guillermo Fonseca