Closed lmcinnes closed 6 years ago
Awesome! Thanks so much for the contribution. I think these are worth including, but I'll defer to @jeremymanning for the decision on that. It looks like the code is currently crashing because if the library is not found, the umap/hdbscan variable is undeclared. Perhaps declare it as None if the import statement fails?
Also - are these models included in sklearn, or will they be included in the future? If so, we could leave the packages as optional for now, and then include them if they are merged into sklearn. Another question - do they follow the sklearn API (i.e. fit, transform, fit_transform)? The code will not work unless they follow that pattern
Hi, thanks for the catch of declaration issues. Hopefully that is not fixed.
Both packages follow the sklearn API (and inherit from the apporpriate sklearn baseclasses). In the case of HDBSCAN inclusion in sklearn is something that may happen in a release or two. UMAP is newer and will have a longer path to inclusion in sklearn proper, but the speed improvements over t-SNE are quite significant, especially for high dimensional data (and non-standard metrics), so I am hopeful it will get there in due course.
Thanks for your contribution!
Overall this looks great and I'm excited to support these new algorithms. Some suggested changes:
hyp.reduce
) and verify that they return the expected answers. (You just need to test hyp.reduce
, since plot
and cluster
both call reduce
internally.)What about the following:
cluster
is None
and n_clusters
is None
. The status of cluster
is checked to see if the user wants to cluster. (If cluster == None
, don't cluster; if not None
, cluster using whatever model is specified)n_clusters
for Kmeans
, etc.). If the user passes a string to specify the clustering model, use the default parameters. If they pass a dictionary of parameters, use whatever parameters they specify and fill in the rest with the default values for that algorithm.cluster
is None
but n_clusters
is not None
, use KMeans clustering with K = n_clusters
. In other words, n_clusters
is a shortcut way of applying KMeans clustering.cluster
and n_clusters
aren't None
, and if the specified algorithm doesn't have an n_clusters
parameter (or the equivalent), output a warning (e.g. "The ___ clustering algorithm automatically determines the number of clusters; ignoring n_clusters argument") and then proceed (but ignore the n_clusters
parameter).I think this will work, thanks!
That sounds like a good and fairly comprehensive approach -- and should leave room for other alternative clustering algorithms later if desired (which is good). I have implemented this in what I believe is a relatively clear way, but please review 42369ed to verify is does what you were wanting.
Thanks for the clean up!
No problem! just made a few quick edits, including a new function to retrieve/update default model params. On a separate PR, i'll do the same for the other functions (reduce, align..)
Hmm, the test build seems to be crashing on the llvmlite installation (requirement of HDBSCAN?) on Python 3.4 but not on the other versions (2.7, 3.5, 3.6). @lmcinnes any ideas why?
It is UMAP that is pulling in llvmlite. I honestly don't know what's up with the python 3.4 version there. I admit that this is definitely not my area of expertise and I am very much in the dark as to how or why this should fail for this one case.
All good! we'll figure it out. Looks like it's crashing bc umap-learn requires scipy>0.19. I don't see any problem with updating hypertools to require scipy>0.19
@lmcinnes looks like upgrading scipy to version 1.0.0 fixed it. @jeremymanning - I think this is ready to merge!
@andrewheusser want to do the honors and merge once the tests finish (assuming they pass)?
Thanks for your contribution, @lmcinnes!
A quick proposal to add HDBSCAN for clustering and UMAP for reduction.
HDBSCAN is a hierarchical density based clustering approach similar to DBSCAN. Like DBSCAN it labels some points as "noise"; that may or may not play nicely with the rest of the code.
UMAP is a dimension reduction technique with similar output to t-SNE but can run much faster, and scale to larger datasets.
I left the packages as unrequired for now and simply added warning if they are unavailable. If you have a preferred way of handling that let me know.