P-values returned for low read counts sometimes seem too significant

mschubert commented 3 years ago

Hi again,

I'm using glmGamPoi_1.3.3 via DESeq2_1.31.3 for 10x sc-RNA-seq data. For one comparison, I have 4096 control cells and 6 cells in my condition (the imbalance is, unfortunately, unavoidable). For a couple of genes where my "condition" cells have no reads, I find the p-values returned are very low. For instance, my gene1 below:

# DESeq2+glmGamPoi p=1e-137
table(counts(eset2)["gene1",-which(cur$is_smp)])
#    0    1    2    3    4    5    6    7
# 1901 1346  553  202   64   20    7    3
table(counts(eset2)["gene1",which(cur$is_smp)])
# 0
# 6

Naively, I would expect that the probability of picking 6 cells with 0 reads out of the distribution of my control cells would be the fraction of zero-read cells to the power to cells, i.e.:

1901/(1901+1346+553+202+64+20+7+3)
# 0.4641113
0.4641113^6
# 0.009993854 # not significant after multiple testing correction

Now, this is of course a probability and not a p-value (although they coincide in this case), and a binomial distribution instead of a gamma mixture of poissons.

But should the significance measure be that different?

const-ae commented 3 years ago

Hey, that is an interesting example. I produced a reprex to make it a bit clearer what is happening:

library(glmGamPoi)
y_ctrl <- c(rep(0, 1901), rep(1, 1346), rep(2, 553), rep(3, 202), rep(4, 64), rep(5, 20), rep(6, 7), rep(7, 3))
y_trt <- rep(0, 6)

label <- c(rep("ctrl", length(y_ctrl)), rep("trt", length(y_trt)))
y <- c(y_ctrl, y_trt)

fit <- glm_gp(y ~ label - 1)
test_de(fit, labelctrl - labeltrt)
#>    name        pval    adj_pval f_statistic df1 df2       lfc
#> 1 row_1 0.002529886 0.002529886    9.118852   1 Inf 144269504

^{Created on 2020-11-24 by the reprex package (v0.3.0.9001)}

As you can see, if you do the test only on this particular gene, the p-value is actually roughly in line with your binomial-distribution based calculation.

The interesting question now is, why is the estimate so different in your case:

The size_factor can have a large impact on the result and change it by some magnitudes:

sf <- c(rep(1, length(y_ctrl)), rep(10, length(y_trt)))
fit2 <- glm_gp(y ~ label - 1, size_factors = sf)
test_de(fit2, labelctrl - labeltrt)
#>    name         pval     adj_pval f_statistic df1 df2       lfc
#> 1 row_1 6.350626e-13 6.350626e-13    51.73543   1 Inf 144269504

The overdispersion shrinkage might influence the result, but shouldn't play such a big role
There was some convergence problem. However, you mentioned earlier that you had updated to the latest version, so this should be (hopefully) no problem
Something else... I think to really understand what is happening there, I would need a reproducible example that does lead to an extreme p-value

mschubert commented 3 years ago

Thanks for the pointers! I'll check size_factor and see how much I can simplify and still reproduce this

const-ae commented 3 years ago

And a more general background, why I am confident that the method works in general, is the comparison with the p-values from edgeR, which uses the same statistical test. In Suppl. Figure S5 and S6 of the glmGamPoi Paper, I observe fairly consistent results between glmGamPoi and edgeR.

However, that of course does not mean, that there aren't cases where glmGamPoi might do something weird. So I very much appreciate your effort in trying to track down where this inconsistency is coming from :)

mschubert commented 3 years ago

As it turns out, this is still a normMatrix issue and I will continue at https://github.com/mikelove/DESeq2/issues/29

const-ae / glmGamPoi

P-values returned for low read counts sometimes seem too significant #13