MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

FDR values show huge difference between `multiBlockVar()` and `ModelGeneVar()` #46

Closed ahy1221 closed 4 years ago

ahy1221 commented 4 years ago

Hi, I'm trying to update my workflow according to the new version of scran (v1.13.30) In the feature selection step, I found that the new function ModelGeneVar() produced a very different hypothesis test result compare to multiBlockVar. The multiBlockVar would give me ~ 19000 genes below FDR < 0.01 while ModelGeneVar only give ~ 300 genes

out <- multiBlockVar(sce, block = sce$donor, trend.args = list(min.mean = 0.5, use.spikes = F))
out.new <- modelGeneVar(sce, block = sce$donor)
par(mfrow = c(1,2))
hist(out$FDR)
hist(out.new$FDR)

The deconvoluted biological component is quite consistent with only slight difference: 图片

I was wondering that what is the difference between two functions ? Do you change the hypothesis test ?

Thanks,

Yao He

LTLA commented 4 years ago

That's correct, I did change the hypothesis test.

To give a recap of the old behavior: this performed a chi-squared test on the ratio of the total variance to the trend. The assumption here is that the log-values are normally distributed; the null hypothesis is that an uninteresting gene has variance equal to the trend. But:

For these reasons, I have not trusted the p-values out of any of the trend-fitting functions in scran for quite some time (see comments in https://github.com/LTLA/HVGDetection2018 for details).

In the new behavior, we perform a z-test using the location of the trend as the mean and the spread of variances around the trend (i.e., the variances of the variances) as the variance. The z-test is loosely justified by the central limit theorem; recall that the variance is estimated as a sum, so the distribution of each estimate (conditional on its true value) should approach normality as the number of cells increases. This allows us to avoid any distributional assumptions about the log-expression values.

The null hypothesis has also changed slightly; while the mean is the same, the variance of variances is now considered, so we can account for uninteresting effects that introduce more scatter around the mean-variance trend. We're assuming that most genes are not highly variable, but we would have to do that anyway to fit the trend in the first place. We also assume that the true variances are normally distributed around the trend with variance proportional to the mean, but this is mostly tolerable.

These changes are entirely consistent with your observations. With the old behavior, if you had enough cells, you would detect every gene above the trend as being a HVG, which is technically correct but not very useful. With the new behavior, you will only detect genes as being a HVG if its variance is significantly higher than the distribution of variances for the rest of the genes, not just higher than the mean of the distribution. This gives you a more focused set that is probably more useful for validation, though note my comments here.

In addition:

ahy1221 commented 4 years ago

Hi Aaron, Thank you for the detailed explanation ! So what is your most recommended metric to select HVGs for running PCA now ?

LTLA commented 4 years ago

Check out https://osca.bioconductor.org/feature-selection.html.

LTLA commented 4 years ago

I'm going to assume that this issue was resolved.