Open marcio1191 opened 3 years ago
Hi @marcio1191, That's an interesting question! You could simply one-hot encode the categorical features and apply the dimensionality reduction algorithm after that. If you're training a neural network, you could apply the dimensionality reduction algorithm on the dataset excluding the categorical features, then add the categorical features as trainable embeddings. Hope this helps.
Thanks @ageron for answering so quickly. The problem with one-hot encode is that for visualization and data reduction techniques, like t-SNE and PCA, there is no "distance/variance" meaning associated with that kind of encoding. The algorithm will work, however, there is no meaning associated with those features. https://stackoverflow.com/questions/40795141/pca-for-categorical-features
Thanks @marcio1191 , it seems I answered too quickly! 😅 You're right, if you choose the first option (using one-hot encoding followed by dim reduction), you have to be careful to use a clustering algorithm compatible with binary values (PCA definitely isn't). I haven't looked into this question very closely, but it seems that Multiple Component Analysis should work.
A couple other approaches you could try:
Hope this helps.
@ageron, Thank you for your help. Regards Marcio Fernandes
Hello, In this case, maybe encoding methods like Target Encoding or CatBoost encoding may help. There are multiple category encoders in http://contrib.scikit-learn.org/category_encoders/. Kind Regards Guillermo Fonseca
Hello In this book its not covered how to handle dimensionality reduction for datasets with categorical features too. How would you handle these situations? Thank you in advance Regards