TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
95 stars 8 forks source link

Distance to clusters #22

Open firmai opened 3 months ago

firmai commented 3 months ago

Anyway to obtain this directly from the library, this would extremely valuable piece of info for me. Given the speed of the algo it could be a nice addition.

hamelin commented 3 months ago

One key question that your suggestion leaves unclear is which the distance? :-) A cluster is a cloud of points, and your question does not make it clear whether you would like to know the distance between a certain point and the various clusters, or the distance between the clusters. In both cases there is yet more to ask.

  1. Distance between point $P$ and clusters: do you mean the distance to the centroid (vector mean) of the clusters? To the medoid (vector median)? To the nearest or furthest point belonging to the cluster?
  2. Distance between clusters: do you mean the distance between single representative points (again, centroid, medoid, and so on), or a collective similarity such as Wasserstein distance?

So of these use cases are easily covered by some simple Numpy/Scikit-Learn-fu, others are very tricky. Let us know what would help you best.

firmai commented 3 months ago

What would be great is some functionality to get access to the vectors that would allow me to calculate (1) and (2). My specific use case is looking at belonging, so distance of a single point to all cluster centroids (median, mean etc.). I am interested in tracking movements from one cluster to another. I wonder if the implementation would allow for that?

Specifically looking at stocks, and realising the cluster label filps every so often, and it would be good to track that transition as a continuous value.

firmai commented 3 months ago

Let me know what you think @hamelin ?

hamelin commented 3 months ago

Sorry I've been slow to respond, I'm on vacation. Let me get back to you after I checked a few things.