[IDEA] Chapter 8 - Reduce dimensionality for datasets with categorical features

marcio1191 commented 3 years ago

Hello In this book its not covered how to handle dimensionality reduction for datasets with categorical features too. How would you handle these situations? Thank you in advance Regards

ageron commented 3 years ago

Hi @marcio1191, That's an interesting question! You could simply one-hot encode the categorical features and apply the dimensionality reduction algorithm after that. If you're training a neural network, you could apply the dimensionality reduction algorithm on the dataset excluding the categorical features, then add the categorical features as trainable embeddings. Hope this helps.

marcio1191 commented 3 years ago

Thanks @ageron for answering so quickly. The problem with one-hot encode is that for visualization and data reduction techniques, like t-SNE and PCA, there is no "distance/variance" meaning associated with that kind of encoding. The algorithm will work, however, there is no meaning associated with those features. https://stackoverflow.com/questions/40795141/pca-for-categorical-features

ageron commented 3 years ago

Thanks @marcio1191 , it seems I answered too quickly! 😅 You're right, if you choose the first option (using one-hot encoding followed by dim reduction), you have to be careful to use a clustering algorithm compatible with binary values (PCA definitely isn't). I haven't looked into this question very closely, but it seems that Multiple Component Analysis should work.

A couple other approaches you could try:

Of course you can try replacing the categorical feature with meaningful data before dim reduction. For example, suppose you're trying to predict a person's life satisfaction, and the training set contains a "city" categorical feature, then you could replace this "city" feature with one or more numerical features such as the city's mean income, crime rate, average rainfall per year, mean commute time, and other things about the city that may affect the life satisfaction of its inhabitants.
If you have access to pretrained embeddings (or if you can generate them using a separate neural net trained on related data), then you could replace the categorical features with the pretrained embeddings and then apply dim reduction.

Hope this helps.

marcio1191 commented 3 years ago

@ageron, Thank you for your help. Regards Marcio Fernandes

memo26167 commented 2 years ago

Hello, In this case, maybe encoding methods like Target Encoding or CatBoost encoding may help. There are multiple category encoders in http://contrib.scikit-learn.org/category_encoders/. Kind Regards Guillermo Fonseca

ageron / handson-ml2

[IDEA] Chapter 8 - Reduce dimensionality for datasets with categorical features #411