Accio / BioQC

Detect tissue heterogeneity in gene expression data with BioQC
http://accio.github.io/BioQC/
GNU General Public License v3.0
5 stars 8 forks source link

SignedGenesets gives `NaN`s for `valType %in% c("r", "f")` #31

Open idavydov opened 3 years ago

idavydov commented 3 years ago

Hi @Accio ,

I noticed that wmwTest() on SignedGenesets returns NaNs for valTypes "r" and "f". Not sure if that is intentional.

m <- matrix(c(
  0, 0, 0, 0, 0, 5,
  0, 0, 0, 0, 5, 5,
  0, 0, 0, 5, 5, 5,
  5, 5, 0, 0, 0, 0,
  5, 0, 0, 0, 0, 0
), byrow=TRUE, nrow=5)

rownames(m) <- paste0("g", 1:5)
colnames(m) <- paste0("s", 1:6)

sign <- BioQC::SignedGenesets(list(
  list(name="grows left to right", positive = c("g1", "g2", "g3"), negative = c()),
  list(name="grows left to right with neg", positive = c("g1", "g2", "g3"), negative = c("g4", "g5"))
))

BioQC::wmwTest(m, sign, valType="U1")
#>                              s1   s2 s3 s4 s5 s6
#> grows left to right           0  1.5  3  4  5  6
#> grows left to right with neg -6 -3.0  0  2  4  6
BioQC::wmwTest(m, sign, valType="r")
#>                                s1   s2  s3        s4        s5  s6
#> grows left to right            -1 -0.5   0 0.3333333 0.6666667   1
#> grows left to right with neg -Inf -Inf NaN       Inf       Inf Inf
BioQC::wmwTest(m, sign, valType="f")
#>                                s1   s2  s3        s4        s5  s6
#> grows left to right             0 0.25 0.5 0.6666667 0.8333333   1
#> grows left to right with neg -Inf -Inf NaN       Inf       Inf Inf

Created on 2021-08-12 by the reprex package (v2.0.1)

idavydov commented 3 years ago

Interestingly, valType="p.greater" always returns either 1 or 0. Not sure if that's correct here.

idavydov commented 3 years ago

Ok, I think I got it. nBg = nTotal-nInds, in grows left to right with neg all genes are included, so nBg is zero.

Then:

res = u1 / nInds / nBg;

Which leads to Inf. Ok; maybe fair enough for "r" and "f".

But for p-values this leads to overly optimistic/pessimistic estimates.Basically, p-values will be always 0 or 1.

This probably means that p-values are "too extreme" in all the cases.

Maybe in this case the correct way would be to compute two different p-values and use something like Fisher's method to aggregate them?

idavydov commented 3 years ago

Here's an example. With totally random signature covering 95 genes out of 100, we always get a skewed p-value distibution:

set.seed(42)
ngenes <- 100
nsamples <- 1000
m <- matrix(rnorm(ngenes * nsamples), ncol=nsamples)
rownames(m) <- paste0("g", seq_len(nrow(m)))
colnames(m) <- paste0("s", seq_len(ncol(m)))

# using all the genes except for the last five
big_signature <- sample(c("pos", "neg"), ngenes - 5, replace = TRUE)

random_sign <- BioQC::SignedGenesets(list(
  list(
    name = "random signature",
    pos = paste0("g", which(big_signature == "pos")),
    neg = paste0("g", which(big_signature == "neg"))
  )
))

hist(BioQC::wmwTest(m, random_sign, valType="p.less"))

Created on 2021-08-12 by the reprex package (v2.0.1)

idavydov commented 3 years ago

I think one way to resolve this problem is the following:

  1. We compute U for positive genes, and -U for negative genes.
  2. Then we convert both values to z-scores (z_pos and z_neg). z = (U - m_u) / sigma_U. Usually m_u = n1 * n2, but perhaps we would want to exclude negative class when computing z_pos and vice versa. Same goes for n1 and n2 in sigma computations with tie correction.
  3. We can compute z = (z_pos + z_neg) / sqrt(2), since sum of two normal distribution is a normal distribution.
  4. Finally, we can compute a p-value.