TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
98 stars 9 forks source link

Feature request: Support for metric='precomputed' #5

Open RichieHakim opened 1 year ago

RichieHakim commented 1 year ago

I'm currently using vanilla HDBSCAN to cluster a precomputed sparse distance matrix being input as a scipy.sparse.csr_matrix object. I'm very eager to use fast_hdbscan due primarily to it's easier compilation requirements as I'm attempting to ship out a tool that uses hdbscan as a step in a pipeline.

Currently, I believe clustering on precomputed sparse distance matrices is not supported in fast_hdbscan. I think it would require the porting of some of the following functions:

Unfortunately, I don't think I'm able to figure out how to implement this one myself. Though, I'm happy to help out in testing any PRs with basic implementations. Thank you for great package and I really hope I'll be able to use it soon!

lmcinnes commented 1 year ago

There are some quirks about sparse precomputed distance matrices that can make things a little tricky for corner-cases. I'll see if I can get something done though. I can't promise any timeframes.

On Sun, May 7, 2023 at 11:27 AM Richard Hakim @.***> wrote:

I'm currently using vanilla HDBSCAN to cluster a precomputed sparse distance matrix being input as a scipy.sparse.csr_matrix object. I'm very eager to use fast_hdbscan due primarily to it's easier compilation requirements as I'm attempting to ship out a tool that uses hdbscan as a step in a pipeline.

Currently, I believe clustering on precomputed sparse distance matrices is not supported in fast_hdbscan. I think it would require the porting of some of the following functions:

  • hdbscan_._hdbscan_sparse_distance_matrix
  • _hdbscan_reachability.sparse_mutual_reachability
  • _hdbscan_linkage.label

Unfortunately, I don't think I'm able to figure out how to implement this one myself. Though, I'm happy to help out in testing any PRs with basic implementations. Thank you for great package and I really hope I'll be able to use it soon!

— Reply to this email directly, view it on GitHub https://github.com/TutteInstitute/fast_hdbscan/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBIKM75A6XXZS63TVKDXE65PFANCNFSM6AAAAAAXY6LCLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RichieHakim commented 1 year ago

Thank you so much for looking into this. I am very motivated to help if you think it's possible to delegate anything. For what it's worth, this is how hdbscan is being used in the project I'm working on: https://github.com/RichieHakim/ROICaT/blob/dev/roicat/tracking/clustering.py#L420

Perhaps bringing up the tricks/hacks that are being used to get desired behavior would be of interest. 1) I'm using a very custom sparse distance matrix as input. 2) Since the graph has multiple disjointed components, I need to add a fully connected node before clustering. 3) Since there are sample pairs that are known to be disconnected a priori, clusters containing these pairs ('pair violations') are split up by walking down the cutting distance until the pair violations are gone.

Playing with the max_dist doesn't help much here. Single linkage is a blessing and a curse it seems. If there was a way for the MST to be blind to any sample that would cause a violation as the tree is built up, that would be of significant utility for tracking software.

Thanks again, I'm a big fan of all your projects.

RichieHakim commented 4 weeks ago

@lmcinnes bumping this based on this conversation: https://github.com/scikit-learn-contrib/hdbscan/issues/299.

I will look into existing semi-supervised methods for vanilla HDBSCAN, and I will look into approaches to recover / convert to embedding vectors from sparse distance matrices so that we can try fast_hdbscan. If there is a way to achieve both in one library, we are very interested. Please let me know if either would benefit from further conversation or resources. Thanks again for these amazing resources.