feature enhancement: add new clustering algorithms

ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data

http://hypertools.readthedocs.io/en/latest/

MIT License

1.83k stars 160 forks source link

feature enhancement: add new clustering algorithms #146

Open jeremymanning opened 7 years ago

jeremymanning commented 7 years ago

DBSCAN: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Gaussian mixture model: http://scikit-learn.org/stable/modules/mixture.html

Possibly others in: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster

alokkumary2j commented 6 years ago

@jeremymanning I am interested in working on this issue. Please let me know, if this is something which is still open, relevant and of high priority.

jeremymanning commented 6 years ago

@alokkumary2j this would be great! Let me or @andyheusser know if you need help getting started.

alokkumary2j commented 6 years ago

Thanks @jeremymanning for the confirmation. Do you or @andrewheusser has any particular thought(s) around how to incorporate these changes? I would definitely like to understand your thoughts before I initiate. For example, the current definition of the cluster supports n_clusters as a parameter, which is not directly supported by Scikit DBSCAN implementation. Also, what would be the best medium to discuss the relevant issues in detail.

andrewheusser commented 6 years ago

My opinion would be to keep the hypertools cluster API as consistent as possible, even if the scikit-learn is different. For the gaussian mixture model, I would map n_components -> n_clusters, while leaving all of the rest of the arguments the same. Since DBSCAN discovers clusters, if the user inputs n_clusters and cluster='DBSCAN', I would just throw a warning that says n_clusters is being ignored.

As for other algorithms, we've already added a few: KMeans, MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering

If there are more you have in mind, we would be open to discussing more! Either here or on our gitter: https://gitter.im/hypertools/Lobby

@jeremymanning - could we get your opinion on this as well?

alokkumary2j commented 6 years ago

Since the Gaussian Mixture Model was relatively straightforward, I completed the work on it and ran through the pytest script with all 97 test cases being passed. Request you (@jeremymanning and @andrewheusser ) to have a look and share your feedback.

Additionally, I would also like to have your feedback on below points:

We can probably provide custom messages to the users, in case wrong parameters or bad input is passed.
Also, if we plan to support more and more Clustering algorithms (Of heterogenous nature), we might want to refactor our code. If we don't do that, I fear to see many "if/else" scenarios in the cluster method.
I observed the plot function taking a long time when provided custom labels.

Please share your thoughts, basis which I will continue my work.

andrewheusser commented 6 years ago

@alokkumary2j awesome! thanks for your work on this! We'll take a look and approve or make suggestions.

To address your questions:

We can probably provide custom messages to the users, in case wrong parameters or bad input is passed. Yes, this would be great in the case where a user passes n_clusters and uses DBSCAN. Did you have another case in mind?

Also, if we plan to support more and more Clustering algorithms (Of heterogenous nature), we might want to refactor our code. If we don't do that, I fear to see many "if/else" scenarios in the cluster method. potentially. ill take a look at your code and see if it makes sense to refactor

I observed the plot function taking a long time when provided custom labels. Can you submit an issue with some code to replicate this? That would be the cleanest way to figure out what's going on.

thanks!