Open jeremymanning opened 7 years ago
@jeremymanning I am interested in working on this issue. Please let me know, if this is something which is still open, relevant and of high priority.
@alokkumary2j this would be great! Let me or @andyheusser know if you need help getting started.
Thanks @jeremymanning for the confirmation. Do you or @andrewheusser has any particular thought(s) around how to incorporate these changes? I would definitely like to understand your thoughts before I initiate. For example, the current definition of the cluster supports n_clusters as a parameter, which is not directly supported by Scikit DBSCAN implementation. Also, what would be the best medium to discuss the relevant issues in detail.
My opinion would be to keep the hypertools cluster API as consistent as possible, even if the scikit-learn is different. For the gaussian mixture model, I would map n_components
-> n_clusters
, while leaving all of the rest of the arguments the same. Since DBSCAN discovers clusters, if the user inputs n_clusters
and cluster='DBSCAN'
, I would just throw a warning that says n_clusters
is being ignored.
As for other algorithms, we've already added a few: KMeans, MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering
If there are more you have in mind, we would be open to discussing more! Either here or on our gitter: https://gitter.im/hypertools/Lobby
@jeremymanning - could we get your opinion on this as well?
Since the Gaussian Mixture Model was relatively straightforward, I completed the work on it and ran through the pytest script with all 97 test cases being passed. Request you (@jeremymanning and @andrewheusser ) to have a look and share your feedback.
Additionally, I would also like to have your feedback on below points:
Please share your thoughts, basis which I will continue my work.
@alokkumary2j awesome! thanks for your work on this! We'll take a look and approve or make suggestions.
To address your questions:
We can probably provide custom messages to the users, in case wrong parameters or bad input is passed. Yes, this would be great in the case where a user passes n_clusters and uses DBSCAN. Did you have another case in mind?
Also, if we plan to support more and more Clustering algorithms (Of heterogenous nature), we might want to refactor our code. If we don't do that, I fear to see many "if/else" scenarios in the cluster method. potentially. ill take a look at your code and see if it makes sense to refactor
I observed the plot function taking a long time when provided custom labels. Can you submit an issue with some code to replicate this? That would be the cleanest way to figure out what's going on.
thanks!
Possibly others in: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster