Silhouette index from bootstrapCluster()

MarioniLab / scran

Clone of the Bioconductor repository for the scran package.

https://bioconductor.org/packages/devel/bioc/html/scran.html

40 stars 22 forks source link

Silhouette index from bootstrapCluster() #58

Closed iwillham closed 4 years ago

iwillham commented 4 years ago

Hi there, I appreciate all of the hard work on this wonderful package.

I'm interested in generating a summary measure of cluster stability for a given clustering method with a set of parameters. Is there a way you would recommend summarizing the results of bootstrapCluster() in a single numerical measure ?

I was thinking about using the coassignment probability matrix generated by bootstrapCluster() to calculate a pseudo silhouette index for each cluster and then averaging them across all clusters. Is this a reasonable approach?

Thanks!

LTLA commented 4 years ago

Is there a way you would recommend summarizing the results of bootstrapCluster() in a single numerical measure ?

A single value does not seem particularly helpful to me, but if you insist, then the "best" way of doing this would be to compute the Rand index between the bootstraps and the original clustering:

# Not tested, you'll have to fiddle with this.
out <- numeric(iterations)
for (i in seq_len(iterations)) {
    chosen <- sample(ncol(x), ncol(x), replace = TRUE)
    resampled <- x[, chosen, drop = FALSE]
    reclusters <- FUN(resampled)
    out[i] <- fossil::rand.index(original, reclusters) # or adj.rand.index
}

This will give you a vector of Rand indices that can be averaged to give you a single number. From a theoretical perspective, the Rand index would probably be what I would end up with if I was forced to condense the coassignment probabilities into a single value. Having said that, I wouldn't be able to tell you what value for the Rand index would represent a "stable" clustering.

I was thinking about using the coassignment probability matrix generated by bootstrapCluster() to calculate a pseudo silhouette index for each cluster and then averaging them across all clusters.

I struggle to imagine how you would be able to do that. The silhouette width is an entirely different concept from the coassignment probabilities, given that the former is computed on a per-cell basis while the latter is computed per pair of clusters.

iwillham commented 4 years ago

@LTLA I really appreciate the detailed response.

The reason I am interested in a single value is because I'd like to determine the effects of varying SNN graph construction parameters (k, weighting methods) and clustering methods on cluster stability. It seemed that having a single value describing cluster stability would allow me to search and compare this parameter space in a more high throughput manner than looking at coassignment probability heatmaps for each individual graph/clustering combination.

I struggle to imagine how you would be able to do that. The silhouette width is an entirely different concept from the coassignment probabilities, given that the former is computed on a per-cell basis while the latter is computed per pair of clusters.

My apologies, I did not explain myself well and butchered the silhouette index in the process. The idea was to calculate the strength of the diagonal on the coassignment probability matrix and use that as an index of overall cluster stability. To do this, I would use a form of the silhouette width calculation on a per-cluster (rather than per-cell) basis. According to https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/silhouette this would essentially give me a scaled difference in coassignment probabilities between a) self and b) the cluster with the next highest coassignment probability. Something like this:

a=coassign[i,i]   #self coassign prob
b=sort(coassign[i,],decreasing=TRUE)[2]   #next-highest coassign prob
(a-b)/max(a,b)   #pseudo silhouette index

I would then repeat this for each cluster and average across all clusters to get an index of cluster stability. Thank you for alerting me to the Rand index, it seems like this would be a more statistically appropriate measure of clustering stability for my purpose.

LTLA commented 4 years ago

The main issue with the Rand index (and other metrics like the cluster modularity) is that it is weighted towards the behavior of clusters with many cells. In the case of the Rand index, if the large clusters are well-preserved, it doesn't matter what the remaining clusters do. Join together, fragment into many tiny clusters, whatever; they just won't have a major impact on the value of the index.

Now, maybe that's a feature, but most likely it's undesirable if you really want to catch differences. A more stringent measure of stability that is more aware of small clusters may be worth considering:

Set summarize=TRUE in bootstrapCluster() to get per-cluster coassignment probabilities.
Compute the difference between self and other for each cluster.
Take the smallest difference across clusters as your measure of cluster stability.

If your smallest difference is big, then the chosen clustering parameters do well at yielding stable clusters that are preserved by bootstrapping. If your smallest difference is small (or negative), then at least one cluster fails to reproduce well, and that's enough to sink it. Quite aggressive but it allows you to make strong statements about any set of clustering parameters that yield a large difference.

iwillham commented 4 years ago

You're only as good as your least stable cluster... I like it.

I was noticing that in the summarized per-cluster coassignment probabilities that sometimes self and other sum to > 1. In trying to wrap my head around this, I came to the conclusion that the self coassignment probability describes the propensity of cells to leave that cluster in the bootstrap replicate. A high self probability indicates that cells are unlikely to leave that cluster during bootstrapping. On the other hand, the other coassignment probability describes the propensity of cells from other clusters to enter the cluster during bootstrapping. Since entering and leaving are independent events, the probabilities don't necessarily need to sum to 1.

Is this the right (albeit highly simplified) way to think about it?

LTLA commented 4 years ago

Close enough. The self probability doesn't even know about the other (original) clusters, it only cares about whether the cells from the same original cluster stick together in the bootstrap clusters. As such, there's no obvious relationship between the self and other probabilities - the latter considers the behavior of other clusters, the former doesn't, so they're looking at different events. For example:

The bootstrapped clustering loses all resolution and puts all cells in one big blob. Both self and other are now equal to 1.
The bootstrapped clustering goes nuts and puts each cell in a separate cluster of size 1. Both self and other are now equal to zero.
The bootstrapped clustering is perfectly equivalent to the original, meaning that self is 1 and other is zero.
The bootstrapped clustering creates the same number of clusters but randomly switches cells, meaning that self and other are equal to the same value somewhere between 0 and 1.

You may have to tinker with (i) the manner of the difference between self and other per cluster, e.g., subtraction or ratio?; and (ii) whether you use some kind of quantile on the difference as a "robust minimum", e.g., picking the 20th percentile of the differences to reflect an expectation that 80% of the clusters are reproducible. Otherwise, if a clustering method successfully reproduces 99% of clusters, using the difference for the one cluster that doesn't reproduce is pretty harsh.

iwillham commented 4 years ago

Thank you for the intuitive explanation, this is super helpful.

I think subtraction will be a more robust measure of the difference between self and other because very low values of other, which are quite common, can blow up the ratio. The quantile approach seems reasonable although I'm trying to understand why using a robust minimum (e.g., 20th percentile) would be preferable over, say, the median. Is it because it captures more of the data? I.e. 80% of the clusters have higher stability values than this vs. 50% of clusters have higher stability than this (median).

I was also considering weighting the self-other distance by # of cells in the original cluster although there doesn't seem to be any obvious relationship between cluster size and stability, at least in this data set.

LTLA commented 4 years ago

FWIW the rand branch now implements a Rand index calculation in clusterRand. This can be used in place of the coassignment probabilities in bootstrapCluster, but do read the documentation. You'll also have to install the scuttle package (see https://github.com/LTLA/scuttle) to use this branch.

I.e. 80% of the clusters have higher stability values than this vs. 50% of clusters have higher stability than this (median).

Yes. Or whatever value you perceive to be "reasonable", e.g., 90% or 95% whatever.

iwillham commented 4 years ago

Sounds good I'll check it out. Thanks for the help!