gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

Semi-metric dissimilarities in glcust and differing branch lengths #81

Open mike-kratz opened 1 year ago

mike-kratz commented 1 year ago

Hi there!

I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?

Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:

Genie clust dendrogram image

Standard hierarchical clustering with average linkage image

Snapshot of original Euclidean dissimilarity matrix (notice that most pairwise dissimilarities are greater than 1, but the genie dendrogram shows most the branch lengths are around 1) image

Thank you for your help,

Mike

gagolews commented 1 year ago

I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?

Yes, any symmetric dissimilarity matrix will do the trick - triangle inequality is not necessary.

You can pass affinity="precomputed" and a distance matrix to Genie (Python) or an object of S3 class dist to gclust (R).

Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:

Genie doesn't merge clusters in increasing order (wrt the distance metric) - sometimes it combines small clusters that are farther away from each other. In such a case, the dendrogram would be a mess, therefore I needed to adjust it heuristically. This is something called the lack of ultrametricity property and is actually not specific to Genie; e.g., centroid-based linkage in the built-in R's hclust also produces degenerated dendrograms.

The left part of the dendrogram still makes sense (the last, large groups - the coarsest level of granularity), and I would stick to that.

mike-kratz commented 1 year ago

@gagolews Thank you for addressing those issues, that makes complete sense! I remember seeing you did a comparison of standard validity measures in a paper, but have not had a chance to read it. Are any of them worthwhile for comparing the hclust vs hclust performance? Because right now my interpretation is just based off of my prior knowledge of the sites.

gagolews commented 1 year ago

People tend to use the Silhouette, the Caliński-Harabasz, and the Dunn index (most often), but, from the perspective of the paper you mentioned [DOI:10.1016/j.ins.2021.10.004] [preprint] I wouldn't recommend relying on any of these measures. 😬 The clusterings they promote are not necessarily valid...

mike-kratz commented 1 year ago

@gagolews Would using something like the cophenetic correlation, plus a scatterplot to visualize the relationship, be a reasonable may to measure a chosen link/method's representation of the dissimilarity matrix? I read about that last week and it seems to be a reasonable method.

gagolews commented 1 year ago

It's definitely worth to try!

HtheChemist commented 10 months ago

@mike-kratz I maybe late, but have you checked SIMPROF, I believe it could be used to drill down the dendrogram to check if each cluster structure is multivariate or not?