ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.83k stars 160 forks source link

Add HDBSCAN and UMAP as options for clustering and reducing. #180

Closed lmcinnes closed 6 years ago

lmcinnes commented 6 years ago

A quick proposal to add HDBSCAN for clustering and UMAP for reduction.

HDBSCAN is a hierarchical density based clustering approach similar to DBSCAN. Like DBSCAN it labels some points as "noise"; that may or may not play nicely with the rest of the code.

UMAP is a dimension reduction technique with similar output to t-SNE but can run much faster, and scale to larger datasets.

I left the packages as unrequired for now and simply added warning if they are unavailable. If you have a preferred way of handling that let me know.

andrewheusser commented 6 years ago

Awesome! Thanks so much for the contribution. I think these are worth including, but I'll defer to @jeremymanning for the decision on that. It looks like the code is currently crashing because if the library is not found, the umap/hdbscan variable is undeclared. Perhaps declare it as None if the import statement fails?

andrewheusser commented 6 years ago

Also - are these models included in sklearn, or will they be included in the future? If so, we could leave the packages as optional for now, and then include them if they are merged into sklearn. Another question - do they follow the sklearn API (i.e. fit, transform, fit_transform)? The code will not work unless they follow that pattern

lmcinnes commented 6 years ago

Hi, thanks for the catch of declaration issues. Hopefully that is not fixed.

Both packages follow the sklearn API (and inherit from the apporpriate sklearn baseclasses). In the case of HDBSCAN inclusion in sklearn is something that may happen in a release or two. UMAP is newer and will have a longer path to inclusion in sklearn proper, but the speed improvements over t-SNE are quite significant, especially for high dimensional data (and non-standard metrics), so I am hopeful it will get there in due course.

jeremymanning commented 6 years ago

Thanks for your contribution!

Overall this looks great and I'm excited to support these new algorithms. Some suggested changes:

jeremymanning commented 6 years ago

What about the following:

andrewheusser commented 6 years ago

I think this will work, thanks!

lmcinnes commented 6 years ago

That sounds like a good and fairly comprehensive approach -- and should leave room for other alternative clustering algorithms later if desired (which is good). I have implemented this in what I believe is a relatively clear way, but please review 42369ed to verify is does what you were wanting.

lmcinnes commented 6 years ago

Thanks for the clean up!

andrewheusser commented 6 years ago

No problem! just made a few quick edits, including a new function to retrieve/update default model params. On a separate PR, i'll do the same for the other functions (reduce, align..)

Hmm, the test build seems to be crashing on the llvmlite installation (requirement of HDBSCAN?) on Python 3.4 but not on the other versions (2.7, 3.5, 3.6). @lmcinnes any ideas why?

lmcinnes commented 6 years ago

It is UMAP that is pulling in llvmlite. I honestly don't know what's up with the python 3.4 version there. I admit that this is definitely not my area of expertise and I am very much in the dark as to how or why this should fail for this one case.

andrewheusser commented 6 years ago

All good! we'll figure it out. Looks like it's crashing bc umap-learn requires scipy>0.19. I don't see any problem with updating hypertools to require scipy>0.19

andrewheusser commented 6 years ago

@lmcinnes looks like upgrading scipy to version 1.0.0 fixed it. @jeremymanning - I think this is ready to merge!

jeremymanning commented 6 years ago

@andrewheusser want to do the honors and merge once the tests finish (assuming they pass)?

jeremymanning commented 6 years ago

Thanks for your contribution, @lmcinnes!