Closed mschubert closed 3 years ago
Hey, that is an interesting example. I produced a reprex to make it a bit clearer what is happening:
library(glmGamPoi)
y_ctrl <- c(rep(0, 1901), rep(1, 1346), rep(2, 553), rep(3, 202), rep(4, 64), rep(5, 20), rep(6, 7), rep(7, 3))
y_trt <- rep(0, 6)
label <- c(rep("ctrl", length(y_ctrl)), rep("trt", length(y_trt)))
y <- c(y_ctrl, y_trt)
fit <- glm_gp(y ~ label - 1)
test_de(fit, labelctrl - labeltrt)
#> name pval adj_pval f_statistic df1 df2 lfc
#> 1 row_1 0.002529886 0.002529886 9.118852 1 Inf 144269504
Created on 2020-11-24 by the reprex package (v0.3.0.9001)
As you can see, if you do the test only on this particular gene, the p-value is actually roughly in line with your binomial-distribution based calculation.
The interesting question now is, why is the estimate so different in your case:
size_factor
can have a large impact on the result and change it by some magnitudes:
sf <- c(rep(1, length(y_ctrl)), rep(10, length(y_trt)))
fit2 <- glm_gp(y ~ label - 1, size_factors = sf)
test_de(fit2, labelctrl - labeltrt)
#> name pval adj_pval f_statistic df1 df2 lfc
#> 1 row_1 6.350626e-13 6.350626e-13 51.73543 1 Inf 144269504
Thanks for the pointers! I'll check size_factor
and see how much I can simplify and still reproduce this
And a more general background, why I am confident that the method works in general, is the comparison with the p-values from edgeR
, which uses the same statistical test. In Suppl. Figure S5 and S6 of the glmGamPoi Paper, I observe fairly consistent results between glmGamPoi
and edgeR
.
However, that of course does not mean, that there aren't cases where glmGamPoi
might do something weird. So I very much appreciate your effort in trying to track down where this inconsistency is coming from :)
As it turns out, this is still a normMatrix
issue and I will continue at https://github.com/mikelove/DESeq2/issues/29
Hi again,
I'm using
glmGamPoi_1.3.3
viaDESeq2_1.31.3
for 10x sc-RNA-seq data. For one comparison, I have 4096 control cells and 6 cells in my condition (the imbalance is, unfortunately, unavoidable). For a couple of genes where my "condition" cells have no reads, I find the p-values returned are very low. For instance, mygene1
below:Naively, I would expect that the probability of picking 6 cells with 0 reads out of the distribution of my control cells would be the fraction of zero-read cells to the power to cells, i.e.:
Now, this is of course a probability and not a p-value (although they coincide in this case), and a binomial distribution instead of a gamma mixture of poissons.
But should the significance measure be that different?