cbg-ethz / dce

Finding the causality in biological pathways
https://cbg-ethz.github.io/dce/
10 stars 3 forks source link

Simulating negative binomial read counts on DAGs #11

Closed kpj closed 4 years ago

kpj commented 4 years ago

Idea 1

beta > 0 describes the relative change. 0.5 corresponds to halving and 2 to doubling the expression levels. This is problematic because it requires a transformation of causal effects which is non-trivial (but possibly somehow doable?).

Idea 2

beta can be both positive and negative. Counts are propagated by multiplying beta with mean-standardized counts and adding noise. This is problematic because standardizing might introduce artefacts and can lead to mu < 0 (which yields NaN counts).

beta <- -1.2

set.seed(42)
A.nb <- rnbinom(1000, size=10, mu=10)

B.nb <- beta * A.nb + rnbinom(1000, size=10, mu=10) # leads to negative counts
B.nb <- rnbinom(1000, size=10, mu=mean(A.nb) + beta * A.nb) # leads to negative mu, thus NA counts
B.nb <- rnbinom(1000, size=10, mu=10) + beta * scale(A.nb, scale=FALSE) # leads to negative counts
B.nb <- rnbinom(1000, size=10, mu=mean(A.nb) + beta * scale(A.nb, scale=FALSE)) # leads to negative mu, thus NA counts

MASS::glm.nb(B.nb ~ A.nb, link="identity")

Idea 3

Use a mean function for mu of rnbinom. This requires an appropriate link function during the regression.

beta <- -1.2

set.seed(42)
A.nb <- rnbinom(1000, size=10, mu=10)

B.nb <- rnbinom(1000, size=10, mu=exp(log(10) + beta * (A.nb - mean(A.nb)))) # link function keeps mu positive, exp can lead to extreme values

MASS::glm.nb(B.nb ~ A.nb, link="log")
glm(B.nb ~ A.nb, family=MASS::negative.binomial(theta=10, link="log"))
glm2::glm2(B.nb ~ A.nb, family=MASS::negative.binomial(theta=10, link="log"))
kpj commented 4 years ago

Identity link

Problem: negative mu. Solution: dynamic mean for the response (+ response-independent offset).

Problem: count distribution unrealistic Solution: initialize source nodes with high dispersion

Problem: does not generalize to real data Solution: 😕

kpj commented 4 years ago

Log link

Problem: can lead to extreme response Problem: mean response depends on coupling to parents

Advantage: count distribution similar to real data Problem: source nodes habe unrealistic count distribution (investigate real data) Solution: introduce artificial source node connected to original sources

MartinFXP commented 4 years ago

How to compare simulated and real data:

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa105/5739438

MartinFXP commented 4 years ago

Current approach:

Simulate with identity and push minimum counts to 1.

Possible adjustment:

Only push minimum to 1, if minimum is less than zero. (this would be in agreement with the current solver)

kpj commented 4 years ago

Adjusted identity link function ...