ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.99k stars 12.8k forks source link

Cluster Algorithms on Categorical Attributes? #111

Closed SuperYorio closed 4 years ago

SuperYorio commented 4 years ago

Hello Geron! Thank you for the great book, it is no understatement to say that it has helped me advanced my career! As a data science research intern at a Medical School, I have a question:

What is YOUR most preferred cluster algorithm for clustering categorical attributes?

I'm trying to find ways to cluster a fake but close to real-world patient dataset that include Gender, Race, Medication & Procedural Record (the meat of the data). I understand that there are algorithms such as "K-Modes" etc, but I am just curious about what your favorite algorithm would be!

Thank you again for the great book and your time! :)

ageron commented 4 years ago

Hi @SuperYorio ,

Thanks for your kind message, I'm really happy that you found my book helpful! :)

I haven't had the chance to tackle many categorical clustering problems, so I don't really have a preferred method, sadly. Besides looking at the basic clustering algorithms (like K-Means), I would take a look at autoencoders: you can train an autoencoder (using Embedding layers for the categorical attributes), and then cluster the learned codings using any clustering algorithm.

There seems to be an abundant literature on categorical clustering, perhaps look for a recent paper that summarizes the state of the art in this domain?

Hope this helps a tiny bit...