LTLA / bluster

Clone of the Bioconductor repository for the bluster package.
https://bioconductor.org/packages/devel/bioc/html/bluster.html
2 stars 3 forks source link

Robustness to Null Dataset Problem #20

Closed DarioS closed 6 months ago

DarioS commented 6 months ago

Is there any summary statistic in Bluster which is known to be robust to the null dataset problem?

While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection and clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is its own cluster. Using real and synthetic datasets, anti-correlated gene selection is found to reduce or eliminate erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.

Source: Anti-correlated Feature Selection Prevents False Discovery of Subpopulations in scRNA-seq, Nature Communications

Perhaps it would be nice to have a SingleCellExperiment-friendly version of this algorithm somewhere in Bioconductor.

LTLA commented 6 months ago

Many papers have been written about detection of overclustering, so it's a pretty well-studied problem. I daresay that most of these papers miss the mark, though, because they don't consider the real scientific question.

tl;dr A homogeneous cell type can still have interesting subclusters.

Consider a cell type whose members are MVN distributed in the expression space (or PC space, or whatever space you care to think of). I think we could both agree that this could be described as "homogeneous" - there aren't any clear subclusters and it's a smooth gradient of density in any direction of travel. However, I would argue that the structure inside this cluster could very well be biologically interesting if, say, an axis of significant variation was associated with some relevant pathway. In such cases, it would make sense to at least try to subcluster and see what you find. If you stop at "oh it's homogeneous", you would never be able to interrogate the heterogeneity within each cell type.

(One could say that it would be better to use trajectory inference for these continuous changes. This is fair enough but it's sometimes hard to figure out when to switch from clusters to trajectories if you don't already know it's continuous. So you usually need at least one subclustering step before you decide that it's continuous enough to switch.)

A long time ago, I decided to use some metrics (WCSS, Rand, modularity ratios) to see if I could automatically determine the appropriate number of clusters. I don't remember the exact results but I do remember being disappointed because I ended up with too-broad clusters, as that was the only thing that the various methods were confident in. Moreover, each of the methods had their own tunable parameters and thresholds, so in the end I was just trading one parameter (the number of clusters) for some other parameters without any clear benefit in interpretation.

I think the fundamental issue is that there isn't a clean mathematical way of expressing that some level of heterogeneity is biologically uninteresting in order to stop the subclustering. I might stop if my subclusters are all related to cell cycle, or metabolic activity, or some other boring thing, but others might get very excited by those same partitions, so who am I to say if they use those subclusters? A true "hard limit" of overclustering is when you start dropping below technical variation (e.g., the Poisson noise from sequencing), at which point you can confidently say that you've jumped the shark. But it takes a lot, like a lot, of overclustering to get to that point, so it's mostly a useless threshold.

In practice, people will always overcluster to see if there's anything interesting as they keep digging. Which is fine, it's all exploratory anyway, no one's really making quantitative claims here. Nonetheless, if you want to implement this method, I'd suggest making your own package; it seems pretty involved and I don't want to be on the hook to maintain it.

DarioS commented 6 months ago

Interesting perspective.