Add HDBSCAN and UMAP as options for clustering and reducing.

ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data

http://hypertools.readthedocs.io/en/latest/

MIT License

1.83k stars 160 forks source link

Add HDBSCAN and UMAP as options for clustering and reducing. #180

Closed lmcinnes closed 6 years ago

lmcinnes commented 6 years ago

A quick proposal to add HDBSCAN for clustering and UMAP for reduction.

HDBSCAN is a hierarchical density based clustering approach similar to DBSCAN. Like DBSCAN it labels some points as "noise"; that may or may not play nicely with the rest of the code.

UMAP is a dimension reduction technique with similar output to t-SNE but can run much faster, and scale to larger datasets.

I left the packages as unrequired for now and simply added warning if they are unavailable. If you have a preferred way of handling that let me know.

andrewheusser commented 6 years ago

Awesome! Thanks so much for the contribution. I think these are worth including, but I'll defer to @jeremymanning for the decision on that. It looks like the code is currently crashing because if the library is not found, the umap/hdbscan variable is undeclared. Perhaps declare it as None if the import statement fails?

andrewheusser commented 6 years ago

Also - are these models included in sklearn, or will they be included in the future? If so, we could leave the packages as optional for now, and then include them if they are merged into sklearn. Another question - do they follow the sklearn API (i.e. fit, transform, fit_transform)? The code will not work unless they follow that pattern

lmcinnes commented 6 years ago

Hi, thanks for the catch of declaration issues. Hopefully that is not fixed.

Both packages follow the sklearn API (and inherit from the apporpriate sklearn baseclasses). In the case of HDBSCAN inclusion in sklearn is something that may happen in a release or two. UMAP is newer and will have a longer path to inclusion in sklearn proper, but the speed improvements over t-SNE are quite significant, especially for high dimensional data (and non-standard metrics), so I am hopeful it will get there in due course.

jeremymanning commented 6 years ago

Thanks for your contribution!

Overall this looks great and I'm excited to support these new algorithms. Some suggested changes:

include hdbscan and umap as requirements (add to requirements.txt and setup.py, get rid of import error try/catch statements
add tests and documentation (expand list of supported algorithms in cluster, plot, and reduce) for HDBSCAN and UMAP. Specifically, the tests should use each of these new algorithms in turn to reduce a small sample dataset (calling hyp.reduce) and verify that they return the expected answers. (You just need to test hyp.reduce, since plot and cluster both call reduce internally.)
consider adding examples that leverage these new algorithms, e.g. visualizing that digits/MNIST or mushrooms datasets we've used in some of our other examples

jeremymanning commented 6 years ago

What about the following:

By default, cluster is None and n_clusters is None. The status of cluster is checked to see if the user wants to cluster. (If cluster == None, don't cluster; if not None, cluster using whatever model is specified)
For each clustering method, specify a dictionary of default parameters (including n_clusters for Kmeans, etc.). If the user passes a string to specify the clustering model, use the default parameters. If they pass a dictionary of parameters, use whatever parameters they specify and fill in the rest with the default values for that algorithm.
Special case: if cluster is None but n_clusters is not None, use KMeans clustering with K = n_clusters. In other words, n_clusters is a shortcut way of applying KMeans clustering.
If both cluster and n_clusters aren't None, and if the specified algorithm doesn't have an n_clusters parameter (or the equivalent), output a warning (e.g. "The ___ clustering algorithm automatically determines the number of clusters; ignoring n_clusters argument") and then proceed (but ignore the n_clusters parameter).

andrewheusser commented 6 years ago

I think this will work, thanks!

lmcinnes commented 6 years ago

That sounds like a good and fairly comprehensive approach -- and should leave room for other alternative clustering algorithms later if desired (which is good). I have implemented this in what I believe is a relatively clear way, but please review 42369ed to verify is does what you were wanting.

lmcinnes commented 6 years ago

Thanks for the clean up!

andrewheusser commented 6 years ago

No problem! just made a few quick edits, including a new function to retrieve/update default model params. On a separate PR, i'll do the same for the other functions (reduce, align..)

Hmm, the test build seems to be crashing on the llvmlite installation (requirement of HDBSCAN?) on Python 3.4 but not on the other versions (2.7, 3.5, 3.6). @lmcinnes any ideas why?

lmcinnes commented 6 years ago

It is UMAP that is pulling in llvmlite. I honestly don't know what's up with the python 3.4 version there. I admit that this is definitely not my area of expertise and I am very much in the dark as to how or why this should fail for this one case.

andrewheusser commented 6 years ago

All good! we'll figure it out. Looks like it's crashing bc umap-learn requires scipy>0.19. I don't see any problem with updating hypertools to require scipy>0.19

andrewheusser commented 6 years ago

@lmcinnes looks like upgrading scipy to version 1.0.0 fixed it. @jeremymanning - I think this is ready to merge!

jeremymanning commented 6 years ago

@andrewheusser want to do the honors and merge once the tests finish (assuming they pass)?

jeremymanning commented 6 years ago

Thanks for your contribution, @lmcinnes!