Open RichieHakim opened 1 year ago
There are some quirks about sparse precomputed distance matrices that can make things a little tricky for corner-cases. I'll see if I can get something done though. I can't promise any timeframes.
On Sun, May 7, 2023 at 11:27 AM Richard Hakim @.***> wrote:
I'm currently using vanilla HDBSCAN to cluster a precomputed sparse distance matrix being input as a scipy.sparse.csr_matrix object. I'm very eager to use fast_hdbscan due primarily to it's easier compilation requirements as I'm attempting to ship out a tool that uses hdbscan as a step in a pipeline.
Currently, I believe clustering on precomputed sparse distance matrices is not supported in fast_hdbscan. I think it would require the porting of some of the following functions:
- hdbscan_._hdbscan_sparse_distance_matrix
- _hdbscan_reachability.sparse_mutual_reachability
- _hdbscan_linkage.label
Unfortunately, I don't think I'm able to figure out how to implement this one myself. Though, I'm happy to help out in testing any PRs with basic implementations. Thank you for great package and I really hope I'll be able to use it soon!
— Reply to this email directly, view it on GitHub https://github.com/TutteInstitute/fast_hdbscan/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBIKM75A6XXZS63TVKDXE65PFANCNFSM6AAAAAAXY6LCLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you so much for looking into this. I am very motivated to help if you think it's possible to delegate anything. For what it's worth, this is how hdbscan is being used in the project I'm working on: https://github.com/RichieHakim/ROICaT/blob/dev/roicat/tracking/clustering.py#L420
Perhaps bringing up the tricks/hacks that are being used to get desired behavior would be of interest. 1) I'm using a very custom sparse distance matrix as input. 2) Since the graph has multiple disjointed components, I need to add a fully connected node before clustering. 3) Since there are sample pairs that are known to be disconnected a priori, clusters containing these pairs ('pair violations') are split up by walking down the cutting distance until the pair violations are gone.
Playing with the max_dist
doesn't help much here. Single linkage is a blessing and a curse it seems. If there was a way for the MST to be blind to any sample that would cause a violation as the tree is built up, that would be of significant utility for tracking software.
Thanks again, I'm a big fan of all your projects.
@lmcinnes bumping this based on this conversation: https://github.com/scikit-learn-contrib/hdbscan/issues/299.
I will look into existing semi-supervised methods for vanilla HDBSCAN, and I will look into approaches to recover / convert to embedding vectors from sparse distance matrices so that we can try fast_hdbscan. If there is a way to achieve both in one library, we are very interested. Please let me know if either would benefit from further conversation or resources. Thanks again for these amazing resources.
I'm currently using vanilla HDBSCAN to cluster a precomputed sparse distance matrix being input as a
scipy.sparse.csr_matrix
object. I'm very eager to use fast_hdbscan due primarily to it's easier compilation requirements as I'm attempting to ship out a tool that uses hdbscan as a step in a pipeline.Currently, I believe clustering on precomputed sparse distance matrices is not supported in fast_hdbscan. I think it would require the porting of some of the following functions:
hdbscan_._hdbscan_sparse_distance_matrix
_hdbscan_reachability.sparse_mutual_reachability
_hdbscan_linkage.label
Unfortunately, I don't think I'm able to figure out how to implement this one myself. Though, I'm happy to help out in testing any PRs with basic implementations. Thank you for great package and I really hope I'll be able to use it soon!