PoonLab / Kaphi

Kernel-embedded ABC-SMC for phylodynamic inference
GNU Affero General Public License v3.0
4 stars 2 forks source link

Validate ABC-SMC estimation for compartmental (SIR) models #42

Closed ArtPoon closed 6 years ago

ArtPoon commented 7 years ago

Just like kernel-ABC validation

gtng92 commented 7 years ago

This is more of a minor annoyance, but I'm unsure about the format of variable sampleStates. I created an example using of SIR non-dynamic compartmental model in pkg/examples/example-compartmental.R and this warning shows on the console:

Warning message:  
In if (!is.na(sampleStates)) 
    if (!is.matrix(sampleStates)) stop("sampleStates must be a matrix (not a data.frame)") :  
      the condition has length > 1 and only the first element will be used

Looking into compartmental-models.R, I've set sampleStates as:

# sample states
  sampleStates <- matrix(1, nrow=tips, ncol=length(demes))
  colnames(sampleStates) <- demes
  rownames(sampleStates) <- 1:tips

This code was derived from kamphir/simulate.SI.R. It creates a matrix with one column and many rows, all with the same value of 1.

In rcolgem.R (Erik Volz's package), I traced this warning back to line 119. It displays regardless, even though sampleStates is a matrix.

I figured when we run the compartmental.model function, we don't want this warning to pop up every single time it simulates a tree.

ArtPoon commented 7 years ago

This is a bug in rcolgem that I fixed a while back but didn't make it into this release for some reason. We need to replace:

if (!is.na(sampleStates))

with

if (all(!is.na(sampleStates))

This should result in a condition of length 1. We could make this change to our fork of rcolgem and direct people to that version. The alternative is to to use the R function suppressWarnings so that we don't get this warning message in our wrapper function. I think the best approach is to make our fork of rcolgem a submodule of Kaphi and edit this line, since there will be other issues that arise.

gtng92 commented 7 years ago

I've simulated a target tree, and was able to calculate kernel distances for a varying parameter. I'm able to initialize workspace. When I run.smc I keep getting the following error:

Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) : 
  negative probability 

The traceback error:

11.sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) at rcolgem.R#1473
10.simulate.binary.dated.tree.fgy(fgy[[1]], fgy[[2]], fgy[[3]], 
    fgy[[4]], sampleTimes, sampleStates, integrationMethod = integrationMethod) at compartmental-models.R#38
9..call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates, births, 
    migrations, deaths, nonDemeDynamics, parms, fgyResolution, 
    integrationMethod) 
8.FUN(X[[i]], ...) 
7.lapply(X = X, FUN = FUN, ...) 
6.sapply(integer(n), eval.parent(substitute(function(...) expr)), 
    simplify = simplify) 
5.replicate(nsim, .call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates, 
    births, migrations, deaths, nonDemeDynamics, parms, fgyResolution, 
    integrationMethod), simplify = FALSE) at compartmental-models.R#211
4.config$model(theta = theta, nsim = config$nsample, tips = workspace$tip.heights, 
    model = model, seed = seed, labels = workspace$tip.labels, 
    ...) 
3.simulate.trees(ws, ws$particles[i, ], model = model, ...) 
2.initialize.smc(ws, model) 
1.run.smc(ws, trace.file = "pkg/examples/example-compartmental.tsv", 
    model = "sir.dynamic") 

I noticed that when I calculated kernel distances for varying t.end, the same error is thrown if:

  1. I first set t.end = 30.*52 in theta for the target tree
  2. then as I vary t.end, the t.end value exceeds the value 30.*52

One possible reason for this may be how I've set my prior distribution for t.end. Currently it's a random normal distribution with mean=100 and sd=5. N is set as lognormal, and the rest of the parameters are gamma distributions. I will see if this error shows up with these ones too.

gtng92 commented 7 years ago

rcolgem.R has a clause in function make.fgy that assigns the following:

demeNames <- rownames(births)
m <- nrow(births)
nonDemeNames <- names(nonDemeDynamics)
mm <- length(nonDemeNames)

and then checks

if (length(x0)!=m + mm) stop('initial conditons incorrect dimension', x0, m, mm) 

births = (ODE expr) therefore m = 1 nonDemeNames = ('S') therefore mm = 1 x0 = (S=S, I=I) therefore length(x0) = 2

This is okay for the sir.nondynamic model but not for the rest, since I'm including compartment R in the ODE expressions as well as compartment E for the seir model. This changes length(x0) to either 3 or 4. Do I then put the recovered and exposed expressions together with the births expression? I would keep the part of the expression that is strictly "deaths" in the deaths compartment.

Before:

births <- rbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I - parms$mu * R'))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))

to:

births <- cbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I ', '- parms$mu * R'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I '))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))

I would also have to change births to cbind instead of rbind so rcolgem's nrow(births) registers the extra expression. Or I could change rcolgem.R to other specifications. But I don't know how much I should mess around with it.

ArtPoon commented 7 years ago

We don't want to incorporate the recovered compartment in the rcolgem expressions, so change this:

deaths <- rbind(c('parms$gamma * I - parms$mu * R'))

to this:

deaths <- rbind(c('parms$gamma * I'))

and exclude R from the initial conditions vector x0.

gtng92 commented 7 years ago

Update on the following error:

Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) : 
  negative probability 

Parameter t.end = 30.*52 make.fgy returns

( 
  fgy[[1]],  # time axis of ODE solution
  fgy[[2]],  # births
  fgy[[3]],  # migrations
  fgy[[4]],  # deme sizes
  fgy[[5]]   # ODE solution
)
Plot for `x=fgy[[2]], y=fgy[[1]]`

rplot

I changed t.end to 1500 and it works ok for a single tree. Later when I run.smc, need to be more stringent up to about 1300 so you won't get negative probability.

gtng92 commented 7 years ago

Individual validation of each parameter: Target tree parameters (true values):

theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)

Configuration settings of priors:

> config$priors
$t.end
[1] "rnorm(n=1,mean=50,sd=5)"

$N
[1] "rnorm(n=1,mean=5000,sd=100)"

$beta
[1] "rgamma(n=1,shape=50,rate=1)"

$gamma
[1] "rgamma(n=1,shape=20,rate=1)"

$mu
[1] "rgamma(n=1,shape=20,rate=1)"

$alpha
[1] "rnorm(n=1,mean=1,sd=0.1)"

Kernel distances for varying N, using seq(100, 10100, 1000) # (from, to, step) 5000--100-10100-1000 Kernel distances for varying beta using seq(0.01, 0.36, 0.025) 0 1--0 01-0 36-0 025b Kernel distances for varying gamma using seq(0.001, 0.08, 0.0025) 0 002--0 001-0 08-0 0025 Kernel distances for varying mu using seq(0.0001, 0.01, 0.0005) 0 00027--0 0001-0 01-0 0005

ArtPoon commented 7 years ago

There's something funny going on with the kernel scores here. Can you please check whether they're getting normalized correctly? I've run into a similar problem before, I'll dig around and make an issue.

On Jun 28, 2017, at 12:46 PM, Tammy Ng notifications@github.com wrote:

Individual validation of each parameter: Target tree parameters (true values):

theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0) Configuration settings of priors:

config$priors $t.end

[ 1] "rnorm(n=1,mean=50,sd=5)"

$N

[ 1] "rnorm(n=1,mean=5000,sd=100)"

$beta

[ 1] "rgamma(n=1,shape=50,rate=1)"

$gamma

[ 1] "rgamma(n=1,shape=20,rate=1)"

$mu

[ 1] "rgamma(n=1,shape=20,rate=1)"

$alpha

[ 1] "rnorm(n=1,mean=1,sd=0.1)" Kernel distances for varying N, using seq(1000, 10000, 1000) # (from, to, step)

Kernel distances for varying beta using seq(0.01, 0.36, 0.025)

Kernel distances for varying gamma using seq(0.001, 0.08, 0.0025)

Kernel distances for varying mu using seq(0.0001, 0.01, 0.0005)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ArtPoon commented 7 years ago

Okay, now that I understand these are kernel distances I think we need to redo the analyses for mu and gamma to account for their low target values (0.00192 and 0.000274, respectively). Might also be worth applying a log-transform on the x-axis for these figures.

I also expect some of these parameters to be confounded, like beta and N, and mu and gamma. Eventually it will be informative to vary these pairs jointly and then look the distribution of kernel distances: see Rosemary's paper on contact networks and SMC-ABC for reference.

gtng92 commented 7 years ago

I've set gamma to 1/520. and mu to 1/3640. Should I switch the values? norm.mode is set to 'NONE'

ArtPoon commented 7 years ago

Those target values are fine, but I would refine the range for simulations to evaluate against the target.

On Jun 28, 2017, at 5:00 PM, Tammy Ng notifications@github.com wrote:

I've set gamma to 1/520. and mu to 1/3640. Should I switch the values? norm.mode is set to 'NONE'

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PoonLab/Kaphi/issues/42#issuecomment-311789305, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDtUDzSCvsGtuCPaaVu9Yk03J4rOvTNks5sIr73gaJpZM4MDQzY.

gtng92 commented 7 years ago

Kernel distances for varying N, for nsim=100 and seq(1000, 10000, 1000) In Rosemary's paper, it says that parameter N and m had little to no identifiability with ABC. Is this for a different kind of scenario? n5000--1000-10000-1000 Kernel distances for beta for nsim=100 and seq(0.01, 0.28, 0.025) beta0 1--0 01-0 28-0 025 Kernel distances for gamma for nsim=100 and seq(0.0005, 0.03, 0.0015) gamma0 0019--0 0005-0 03-0 0015

ArtPoon commented 7 years ago

Rosemary's paper focuses on a very different kind of model (network based epidemic). However there are certainly similarities - it is a susceptible-infected process on a network. Actually her model found some signal for N, only I (number of infected) and N are confounded. At this point I think we should switch out emphasis to bringing in other measures of tree shape, to see if we can improve the performance of ABC-SMC. Let's plan it out at our meeting Monday.

On Jun 29, 2017, at 10:48 AM, Tammy Ng notifications@github.com wrote:

Kernel distances for varying N, for nsim=100 and seq(1000, 10000, 1000) In Rosemary's paper, it says that parameter N and m had little to no identifiability with ABC. Is this for a different kind of scenario?

Kernel distances for beta for nsim=100 and seq(0.01, 0.28, 0.025)

Kernel distances for gamma for nsim=100 and seq(0.0005, 0.03, 0.0015)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ArtPoon commented 7 years ago

By the way, plots are much improved, thanks for refining the range on beta and gamma.
Until Monday, can you proceed with running ABC-SMC to assess performance and also to identify usability problems?

gtng92 commented 7 years ago

Yep, I'll look for a theta that reproduces the negative probability error.

ArtPoon commented 7 years ago

Try using the following approach to examine the distribution of kernel score over two model parameters:

plot(x, y, cex=sqrt(z))  # where x and y are two parameters and z is kernel score/distance

You can also use the rgb argument to plug in kernel distance as a red, green or blue argument (or two simultaneously).

gtng92 commented 7 years ago

Am I doing this correctly? kdists4varyingbeta beta n kdists4varyingn beta n

ArtPoon commented 7 years ago

Well, the trick is that you need to vary both beta and N - I suggest doing a "grid search" where you examine all pairwise combinations of values {beta} x {N}.

On Jul 5, 2017, at 9:27 AM, Tammy Ng notifications@github.com wrote:

Am I doing this correctly?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

MathiasRenaud commented 7 years ago

I have started a grid search script in pkg/examples/example-bd.R for birth-death model, which can be adapted to other models. Still working on the plot and testing.

gtng92 commented 7 years ago

This is the heatmap for N and beta with the different kernel distances. I can create a more specific one as well. True value for beta is 0.1, and true value for N is 5000. heatmap-beta-n

gtng92 commented 7 years ago

Heatmap for pairwise comparisons of gamma and mu. True value for gamma is 0.00192, and true value for mu is 0.00027. It looks like the gamma values are consistently overestimated. heatmap-gamma-mu

ArtPoon commented 7 years ago

There are a few things to try here:

  1. Increase the resolution of numerical approximation for ODE (fgy.resol parameter) to reduce chance of numerical error causing negative probability.
  2. Ignore simulations where negative probability error arises and re-attempt simulation.
ArtPoon commented 7 years ago

There's a bug in compartmental-models.R, deaths term should not appear in births expression; for example:

    births <- rbind(c('parms$beta * S * I / (S+I) - parms$gamma * I'))

should be

    births <- rbind(c('parms$beta * S * I / (S+I)'))

The only place that parms$gamma * I should appear is in the subsequent deaths expression.

ArtPoon commented 7 years ago

Correcting the births matrix specification did not resolve the problem. Currently when the pseudorandom number generator is fixed with a given seed, the sixth particle being updated in SMC runs into a "negative probability" error (presumably associated with coalescent simulation).

gtng92 commented 7 years ago

The parameter values for some actually didn't make sense. In a previous comment, I've set the target tree with theta of:

theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)

Configuration settings of priors beta, gamma, and mu in particular were previously:

> config$priors
$beta
[1] "rgamma(n=1,shape=50,rate=1)"
$gamma
[1] "rgamma(n=1,shape=20,rate=1)"
$mu
[1] "rgamma(n=1,shape=20,rate=1)"

And now that I've changed them to:

> config$priors
$beta
[1] "rgamma(n=1,shape=1,rate=5)"
$gamma
[1] "rgamma(n=1,shape=1,rate=10)"
$mu
[1] "rgamma(n=1,shape=0.2,rate=50)"

I can now initialize.smc for 100 particles (so far as I've tested). Thanks!

gtng92 commented 7 years ago

Ran with parameters in commit 4d37ca4 executed with 500 particles finished in approximately 12.39 hours Step 110 epsilon: 0.1023741 ESS: 187.2476 accept: 0.2204082 elapsed: 44595.8 s

Trajectories of the mean estimates of parameters in theta are able to be viewed in this file: example-compartmental.pdf

In the plots it looks like there's still room to converge for parameters beta, gamma, and mu. But if I lower my accept rate, the runtime would take a while longer.

ArtPoon commented 7 years ago

Check whether collecting samples with invalid values of t.end is skewing the other parameter estimates by fixing t.end to the actual value and then comparing to the current set of results (where invalid t.end can result in negative probabilities in the coalescent simulation, but only throws a warning and not an exception).

ArtPoon commented 7 years ago

@gtng92 observed that fixing to actual t.end is producing the negative probability error, so let's try using a uniform distribution with a lower bound at actual t.end and upper bound some small amount above this value, and varying these bounds until we no longer get this error.

gtng92 commented 6 years ago

From Issue #120, I decided to implement the idea of incorporating a dummy tree upon an error encountered in rcolgem. From there I was able to run a grid-search over a wide range of varying parameters for all params in theta (except alpha, which isn't used in an sir.dynamic model --> still working on stripping that part out when it's irrelevant to the model)

True Params:

theta <- c(t.end=200, N=10000, beta=0.1, gamma=0.2, mu=0.01, alpha=5)

Grid-search Params:

t.end <- seq(50, 400, 50)
N <- seq(7000, 14000, 1000)
beta <- seq(0.01, 0.36, 0.05)
gamma <- seq(0.01, 0.36, 0.05)
mu <- seq(0.001, 0.071, 0.01)
alpha <- 5

Average of top 10 thetas with lowest kernel distances:

avg.theta <- c(t.end=215, N=10900, beta=0.12, gamma=0.235, mu=0.03451, alpha=5)

Average of top 20 thetas:

avg.theta <- c(t.end=217.5, N=10277.78, beta=0.115, gamma=0.23, mu=0.0365, alpha=5)
ArtPoon commented 6 years ago

Ok, careful about this though -- if there was no real difference among trees with respect to kernel score, one could arbitrarily pick 10 or 20 trees and the mean parameter estimates would still be around these values based on the way you've designed the grid. It will be more convincing when we see the actual distribution of distances over the grid. Thanks!

gtng92 commented 6 years ago

From viewing the distributions, varying t.end and N seem to have little to no effect on the kernel distance. beta and gamma in particular seemed the most estimable (agrees with what was estimable before). beta-gamma-distance

It's a little hard to see the distribution in this plot, but I can show the distributions at the dev meeting with a 3D scatterplot that is rotatable.

gtng92 commented 6 years ago

Trajectory of Mean t.end seems to be improving, although admittedly my priors were relatively narrow. It does look better than the t.end trajectory in the PDF file linked in the comment from Aug 17, 2017. Parameter beta still seems to have a strange spike up to 40 iterations. trajectory of mean t end

ArtPoon commented 6 years ago

I think we're in danger of fishing for results here. Let's see what distances are like using other metrics. Thanks!

gtng92 commented 6 years ago

To check, I widened the prior to a normal distribution with a mean of 300 and a standard deviation of 100. True value was 200. t end-mean300-sd100

ArtPoon commented 6 years ago

Deprecated by #123