diffeo / kodama

Fast hierarchical agglomerative clustering in Rust.
MIT License
90 stars 11 forks source link

Add Medoid linkage #12

Open kno10 opened 4 months ago

kno10 commented 4 months ago

Medoid linkage (there are actually multiple with this name, unfortunately) is inspired by the well-known k-meodids clustering concept (Partitioning Around Medoids, PAM), but in a hierarchical way instead of fixing the number of clusters beforehand.

Every cluster is represented by a medoid, which minimizes the distance sum. $$\min{c\in A\cup B} \sum{o\in A\cup B} d(c,o)$$

Two versions are proposed in the article below, the second uses the change, similar to how Ward linkage uses the change in sum of squares. $$\min{c\in A\cup B} \sum{o\in A\cup B} d(c,o) - \min{c\in A} \sum{o\in A} d(c,o) - \min{c\in B} \sum{o\in B} d(c,o)$$

Schubert, Erich (2021).
HACAM: Hierarchical Agglomerative Clustering Around Medoids – and its Limitations.
LWDA’21: Lernen, Wissen, Daten, Analysen September 01–03, 2021, Munich, Germany.

Another "medoid" linkage was proposed earlier based on median/centroid linkage, as the distance of the medoids of the clusters, $d(m_A, m_B)$. In my opinion this is not well defined, because the medoids $m_A$, $m_B$ are not uniquely determined. Simply consider clusters of two points, then either point can be the medoid. Given the integers 1,2,3,4, clustered into (1,2) and (3,4), the distance of the medoids can be 1,2, or 3. A more stable, but also more costly, computation would be the minimum over the pairwise distances of all medoid candidates. Above version does not have this issue, because it does not need the medoid $c$ to be unique.

Miyamoto, Sadaaki; Kaizu, Yousuke; Endo, Yasunori (2016).
Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures.
2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS). pp. 400–403

Herr, Dominik; Han, Qi; Lohmann, Steffen; Ertl, Thomas (2016).
Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data.
Graphics Interface. Graphics Interface.

BurntSushi commented 4 months ago

Hiya! I'm not sure this library will be receiving new features. If you'd like to see new linkages added (including the others you've requested), I'd suggest forking the library.