enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
569 stars 19 forks source link

Compare two Clusterings interactively #61

Open enjalot opened 1 month ago

enjalot commented 1 month ago

It would be great to have a page dedicated to comparing the results of two clusterings of the same data.

This StackExchange post has many useful pointers for potential techniques to enable, as well as some of the challenges to consider.

enjalot commented 6 days ago

I see that the MTEB evaluates the clustering capability of embedding models using V-measure (to compare a k-means clustering vs ground-truth labels): https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskClusteringFast.py

V-measure is a metric that evaluates the quality of clustering by comparing the cluster assignments to the true labels. It's the harmonic mean of two other metrics: homogeneity and completeness.

Homogeneity: Measures whether each cluster contains only members of a single class. Completeness: Measures whether all members of a given class are assigned to the same cluster.

The V-measure ranges from 0 to 1, where 1 indicates perfect clustering.

Some ideas from Claude for comparing clusters with variable sizes:

Calculate V-measure: We can still calculate the V-measure between the HDBSCAN clusters and the true labels. The interpretation would be slightly different:

If HDBSCAN finds fewer clusters than true labels, a high V-measure would indicate that the embeddings are grouping semantically similar categories together. If HDBSCAN finds more clusters than true labels, a high V-measure would suggest that the embeddings are capturing fine-grained semantic distinctions within categories.

Additional metrics: We could introduce additional metrics to complement the V-measure:

Adjusted Rand Index (ARI) or Adjusted Mutual Information (AMI), which are also suitable for comparing clusterings with different numbers of clusters Silhouette score to measure how well-separated the HDBSCAN clusters are A measure of how close the number of HDBSCAN clusters is to the number of true labels