FDR values show huge difference between `multiBlockVar()` and `ModelGeneVar()`

ahy1221 commented 4 years ago

Hi, I'm trying to update my workflow according to the new version of scran (v1.13.30) In the feature selection step, I found that the new function ModelGeneVar() produced a very different hypothesis test result compare to multiBlockVar. The multiBlockVar would give me ~ 19000 genes below FDR < 0.01 while ModelGeneVar only give ~ 300 genes

out <- multiBlockVar(sce, block = sce$donor, trend.args = list(min.mean = 0.5, use.spikes = F))
out.new <- modelGeneVar(sce, block = sce$donor)
par(mfrow = c(1,2))
hist(out$FDR)
hist(out.new$FDR)

The deconvoluted biological component is quite consistent with only slight difference:

I was wondering that what is the difference between two functions ? Do you change the hypothesis test ?

Thanks,

Yao He

LTLA commented 4 years ago

That's correct, I did change the hypothesis test.

To give a recap of the old behavior: this performed a chi-squared test on the ratio of the total variance to the trend. The assumption here is that the log-values are normally distributed; the null hypothesis is that an uninteresting gene has variance equal to the trend. But:

Log-expression values are not normally distributed, especially for UMI data. This immediately invalidates the chi-squared assumption for the variance estimates.
The null hypothesis fails to consider the presence of uninteresting genes with variances systematically above the trend. This is common if some genes are consistently more or less variable, e.g., due to biophysical differences that promote consistency capture, or additional variance related to transcriptional noise.

For these reasons, I have not trusted the p-values out of any of the trend-fitting functions in scran for quite some time (see comments in https://github.com/LTLA/HVGDetection2018 for details).

In the new behavior, we perform a z-test using the location of the trend as the mean and the spread of variances around the trend (i.e., the variances of the variances) as the variance. The z-test is loosely justified by the central limit theorem; recall that the variance is estimated as a sum, so the distribution of each estimate (conditional on its true value) should approach normality as the number of cells increases. This allows us to avoid any distributional assumptions about the log-expression values.

The null hypothesis has also changed slightly; while the mean is the same, the variance of variances is now considered, so we can account for uninteresting effects that introduce more scatter around the mean-variance trend. We're assuming that most genes are not highly variable, but we would have to do that anyway to fit the trend in the first place. We also assume that the true variances are normally distributed around the trend with variance proportional to the mean, but this is mostly tolerable.

These changes are entirely consistent with your observations. With the old behavior, if you had enough cells, you would detect every gene above the trend as being a HVG, which is technically correct but not very useful. With the new behavior, you will only detect genes as being a HVG if its variance is significantly higher than the distribution of variances for the rest of the genes, not just higher than the mean of the distribution. This gives you a more focused set that is probably more useful for validation, though note my comments here.

In addition:

The trend fitting has changed to make it easier to fit to the endogenous genes without having to specify detailed loess options.
Combining statistics across blocks in modelGeneVar will not, by default, weight each block by the number of cells. This is intended for the typical use case where each block represents a replicate and you want each block to contribute equally rather than being dominated by the replicate with the most cells. In contrast, multiBlockVar would weight by default.

ahy1221 commented 4 years ago

Hi Aaron, Thank you for the detailed explanation ! So what is your most recommended metric to select HVGs for running PCA now ?

LTLA commented 4 years ago

Check out https://osca.bioconductor.org/feature-selection.html.

LTLA commented 4 years ago

I'm going to assume that this issue was resolved.

MarioniLab / scran

FDR values show huge difference between `multiBlockVar()` and `ModelGeneVar()` #46