Clustering: mimic new R chapter where we tune num clusters

UBC-DSCI / introduction-to-datascience-python

Open Source Textbook for DSCI100: Introduction to Data Science in Python

https://python.datasciencebook.ca

Other

12 stars 9 forks source link

Clustering: mimic new R chapter where we tune num clusters #213

Open trevorcampbell opened 1 year ago

trevorcampbell commented 1 year ago

Right now in the py version of the book, we tune the number of clusters manually (we run a pipeline for each $k$, manually extract results, plot). This was closer to the old version of the R book. New version of the R book uses tidyclust, which is more aligned with the classification/regression chapters in its tuning method.

Is there a similar update we can make to the py book?

Make sure to propagate this change to the worksheets if we do this.

joelostblom commented 1 year ago

I had the same thought and looked at this briefly. Based on what I found, I don't think it is easily possible. See https://github.com/scikit-learn/scikit-learn/issues/6154 for details. There are some workaround suggested on SO, but nothing convenient

The clusteval package might be something to look into, not sure it works with sklearn https://stackoverflow.com/questions/34611038/grid-search-for-hyperparameter-evaluation-of-clustering-in-scikit-learn and https://erdogant.github.io/clusteval/pages/html/index.html
https://stackoverflow.com/questions/25633383/how-can-gridsearchcv-be-used-for-clustering-meanshift-or-dbscan