Closed ArtPoon closed 6 years ago
This is more of a minor annoyance, but I'm unsure about the format of variable sampleStates
.
I created an example using of SIR non-dynamic compartmental model in pkg/examples/example-compartmental.R
and this warning shows on the console:
Warning message:
In if (!is.na(sampleStates))
if (!is.matrix(sampleStates)) stop("sampleStates must be a matrix (not a data.frame)") :
the condition has length > 1 and only the first element will be used
Looking into compartmental-models.R
, I've set sampleStates
as:
# sample states
sampleStates <- matrix(1, nrow=tips, ncol=length(demes))
colnames(sampleStates) <- demes
rownames(sampleStates) <- 1:tips
This code was derived from kamphir/simulate.SI.R
. It creates a matrix with one column and many rows, all with the same value of 1.
In rcolgem.R
(Erik Volz's package), I traced this warning back to line 119. It displays regardless, even though sampleStates
is a matrix.
I figured when we run the compartmental.model
function, we don't want this warning to pop up every single time it simulates a tree.
This is a bug in rcolgem that I fixed a while back but didn't make it into this release for some reason. We need to replace:
if (!is.na(sampleStates))
with
if (all(!is.na(sampleStates))
This should result in a condition of length 1. We could make this change to our fork of rcolgem and direct people to that version.
The alternative is to to use the R function suppressWarnings
so that we don't get this warning message in our wrapper function.
I think the best approach is to make our fork of rcolgem a submodule of Kaphi and edit this line, since there will be other issues that arise.
I've simulated a target tree, and was able to calculate kernel distances for a varying parameter.
I'm able to initialize workspace. When I run.smc
I keep getting the following error:
Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) :
negative probability
The traceback error:
11.sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) at rcolgem.R#1473
10.simulate.binary.dated.tree.fgy(fgy[[1]], fgy[[2]], fgy[[3]],
fgy[[4]], sampleTimes, sampleStates, integrationMethod = integrationMethod) at compartmental-models.R#38
9..call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates, births,
migrations, deaths, nonDemeDynamics, parms, fgyResolution,
integrationMethod)
8.FUN(X[[i]], ...)
7.lapply(X = X, FUN = FUN, ...)
6.sapply(integer(n), eval.parent(substitute(function(...) expr)),
simplify = simplify)
5.replicate(nsim, .call.rcolgem(x0, t0, t.end, sampleTimes, sampleStates,
births, migrations, deaths, nonDemeDynamics, parms, fgyResolution,
integrationMethod), simplify = FALSE) at compartmental-models.R#211
4.config$model(theta = theta, nsim = config$nsample, tips = workspace$tip.heights,
model = model, seed = seed, labels = workspace$tip.labels,
...)
3.simulate.trees(ws, ws$particles[i, ], model = model, ...)
2.initialize.smc(ws, model)
1.run.smc(ws, trace.file = "pkg/examples/example-compartmental.tsv",
model = "sir.dynamic")
I noticed that when I calculated kernel distances for varying t.end
, the same error is thrown if:
t.end = 30.*52
in theta
for the target treet.end
, the t.end
value exceeds the value 30.*52One possible reason for this may be how I've set my prior distribution for t.end
. Currently it's a random normal distribution with mean=100 and sd=5.
N is set as lognormal, and the rest of the parameters are gamma distributions. I will see if this error shows up with these ones too.
rcolgem.R
has a clause in function make.fgy
that assigns the following:
demeNames <- rownames(births)
m <- nrow(births)
nonDemeNames <- names(nonDemeDynamics)
mm <- length(nonDemeNames)
and then checks
if (length(x0)!=m + mm) stop('initial conditons incorrect dimension', x0, m, mm)
births = (ODE expr) therefore m = 1 nonDemeNames = ('S') therefore mm = 1 x0 = (S=S, I=I) therefore length(x0) = 2
This is okay for the sir.nondynamic model but not for the rest, since I'm including compartment R in the ODE expressions as well as compartment E for the seir model. This changes length(x0) to either 3 or 4.
Do I then put the recovered
and exposed
expressions together with the births
expression?
I would keep the part of the expression that is strictly "deaths" in the deaths
compartment.
Before:
births <- rbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I - parms$mu * R'))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))
to:
births <- cbind(c('parms$beta * S * I / (S+I) - (parms$gamma + parms$mu) * I ', '- parms$mu * R'))
migrations <- rbind(c('0'))
deaths <- rbind(c('parms$gamma * I '))
nonDemeDynamics <- rbind(c('parms$mu * (I+R) - parms$beta * S * I / (S+I)'))
I would also have to change births to cbind
instead of rbind
so rcolgem's nrow(births)
registers the extra expression.
Or I could change rcolgem.R to other specifications. But I don't know how much I should mess around with it.
We don't want to incorporate the recovered compartment in the rcolgem expressions, so change this:
deaths <- rbind(c('parms$gamma * I - parms$mu * R'))
to this:
deaths <- rbind(c('parms$gamma * I'))
and exclude R
from the initial conditions vector x0
.
Update on the following error:
Error in sample.int(m^2, size = 1, prob = as.vector(.lambdamat)) :
negative probability
Parameter t.end
= 30.*52
make.fgy
returns
(
fgy[[1]], # time axis of ODE solution
fgy[[2]], # births
fgy[[3]], # migrations
fgy[[4]], # deme sizes
fgy[[5]] # ODE solution
)
Plot for `x=fgy[[2]], y=fgy[[1]]`
I changed t.end
to 1500 and it works ok for a single tree. Later when I run.smc
, need to be more stringent up to about 1300 so you won't get negative probability.
Individual validation of each parameter: Target tree parameters (true values):
theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)
Configuration settings of priors:
> config$priors
$t.end
[1] "rnorm(n=1,mean=50,sd=5)"
$N
[1] "rnorm(n=1,mean=5000,sd=100)"
$beta
[1] "rgamma(n=1,shape=50,rate=1)"
$gamma
[1] "rgamma(n=1,shape=20,rate=1)"
$mu
[1] "rgamma(n=1,shape=20,rate=1)"
$alpha
[1] "rnorm(n=1,mean=1,sd=0.1)"
Kernel distances for varying N
, using seq(100, 10100, 1000) # (from, to, step)
Kernel distances for varying beta
using seq(0.01, 0.36, 0.025)
Kernel distances for varying gamma
using seq(0.001, 0.08, 0.0025)
Kernel distances for varying mu
using seq(0.0001, 0.01, 0.0005)
There's something funny going on with the kernel scores here. Can you please check whether they're getting normalized correctly? I've run into a similar problem before, I'll dig around and make an issue.
On Jun 28, 2017, at 12:46 PM, Tammy Ng notifications@github.com wrote:
Individual validation of each parameter: Target tree parameters (true values):
theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0) Configuration settings of priors:
config$priors $t.end
[ 1] "rnorm(n=1,mean=50,sd=5)"
$N
[ 1] "rnorm(n=1,mean=5000,sd=100)"
$beta
[ 1] "rgamma(n=1,shape=50,rate=1)"
$gamma
[ 1] "rgamma(n=1,shape=20,rate=1)"
$mu
[ 1] "rgamma(n=1,shape=20,rate=1)"
$alpha
[ 1] "rnorm(n=1,mean=1,sd=0.1)" Kernel distances for varying N, using seq(1000, 10000, 1000) # (from, to, step)
Kernel distances for varying beta using seq(0.01, 0.36, 0.025)
Kernel distances for varying gamma using seq(0.001, 0.08, 0.0025)
Kernel distances for varying mu using seq(0.0001, 0.01, 0.0005)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Okay, now that I understand these are kernel distances I think we need to redo the analyses for mu
and gamma
to account for their low target values (0.00192
and 0.000274
, respectively). Might also be worth applying a log-transform on the x-axis for these figures.
I also expect some of these parameters to be confounded, like beta
and N
, and mu
and gamma
. Eventually it will be informative to vary these pairs jointly and then look the distribution of kernel distances: see Rosemary's paper on contact networks and SMC-ABC for reference.
I've set gamma to 1/520. and mu to 1/3640. Should I switch the values?
norm.mode
is set to 'NONE'
Those target values are fine, but I would refine the range for simulations to evaluate against the target.
On Jun 28, 2017, at 5:00 PM, Tammy Ng notifications@github.com wrote:
I've set gamma to 1/520. and mu to 1/3640. Should I switch the values? norm.mode is set to 'NONE'
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PoonLab/Kaphi/issues/42#issuecomment-311789305, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDtUDzSCvsGtuCPaaVu9Yk03J4rOvTNks5sIr73gaJpZM4MDQzY.
Kernel distances for varying N
, for nsim=100
and seq(1000, 10000, 1000)
In Rosemary's paper, it says that parameter N
and m
had little to no identifiability with ABC. Is this for a different kind of scenario?
Kernel distances for beta
for nsim=100
and seq(0.01, 0.28, 0.025)
Kernel distances for gamma
for nsim=100
and seq(0.0005, 0.03, 0.0015)
Rosemary's paper focuses on a very different kind of model (network based epidemic). However there are certainly similarities - it is a susceptible-infected process on a network. Actually her model found some signal for N, only I (number of infected) and N are confounded. At this point I think we should switch out emphasis to bringing in other measures of tree shape, to see if we can improve the performance of ABC-SMC. Let's plan it out at our meeting Monday.
On Jun 29, 2017, at 10:48 AM, Tammy Ng notifications@github.com wrote:
Kernel distances for varying N, for nsim=100 and seq(1000, 10000, 1000) In Rosemary's paper, it says that parameter N and m had little to no identifiability with ABC. Is this for a different kind of scenario?
Kernel distances for beta for nsim=100 and seq(0.01, 0.28, 0.025)
Kernel distances for gamma for nsim=100 and seq(0.0005, 0.03, 0.0015)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
By the way, plots are much improved, thanks for refining the range on beta and gamma.
Until Monday, can you proceed with running ABC-SMC to assess performance and also to identify usability problems?
Yep, I'll look for a theta that reproduces the negative probability error.
Try using the following approach to examine the distribution of kernel score over two model parameters:
plot(x, y, cex=sqrt(z)) # where x and y are two parameters and z is kernel score/distance
You can also use the rgb
argument to plug in kernel distance as a red, green or blue argument (or two simultaneously).
Am I doing this correctly?
Well, the trick is that you need to vary both beta and N - I suggest doing a "grid search" where you examine all pairwise combinations of values {beta} x {N}.
On Jul 5, 2017, at 9:27 AM, Tammy Ng notifications@github.com wrote:
Am I doing this correctly?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I have started a grid search script in pkg/examples/example-bd.R
for birth-death model, which can be adapted to other models. Still working on the plot and testing.
This is the heatmap for N and beta with the different kernel distances. I can create a more specific one as well. True value for beta is 0.1, and true value for N is 5000.
Heatmap for pairwise comparisons of gamma and mu. True value for gamma is 0.00192, and true value for mu is 0.00027. It looks like the gamma values are consistently overestimated.
There are a few things to try here:
fgy.resol
parameter) to reduce chance of numerical error causing negative probability.There's a bug in compartmental-models.R
, deaths term should not appear in births expression; for example:
births <- rbind(c('parms$beta * S * I / (S+I) - parms$gamma * I'))
should be
births <- rbind(c('parms$beta * S * I / (S+I)'))
The only place that parms$gamma * I
should appear is in the subsequent deaths
expression.
Correcting the births
matrix specification did not resolve the problem. Currently when the pseudorandom number generator is fixed with a given seed, the sixth particle being updated in SMC runs into a "negative probability" error (presumably associated with coalescent simulation).
The parameter values for some actually didn't make sense. In a previous comment, I've set the target tree with theta
of:
theta <- c(t.end=50, N=5000, beta=0.1, gamma=1/520, mu=1/3640, alpha=0)
Configuration settings of priors beta
, gamma
, and mu
in particular were previously:
> config$priors
$beta
[1] "rgamma(n=1,shape=50,rate=1)"
$gamma
[1] "rgamma(n=1,shape=20,rate=1)"
$mu
[1] "rgamma(n=1,shape=20,rate=1)"
And now that I've changed them to:
> config$priors
$beta
[1] "rgamma(n=1,shape=1,rate=5)"
$gamma
[1] "rgamma(n=1,shape=1,rate=10)"
$mu
[1] "rgamma(n=1,shape=0.2,rate=50)"
I can now initialize.smc
for 100 particles (so far as I've tested).
Thanks!
Ran with parameters in commit 4d37ca4 executed with 500 particles finished in approximately 12.39 hours
Step 110 epsilon: 0.1023741 ESS: 187.2476 accept: 0.2204082 elapsed: 44595.8 s
Trajectories of the mean estimates of parameters in theta
are able to be viewed in this file:
example-compartmental.pdf
In the plots it looks like there's still room to converge for parameters beta, gamma, and mu. But if I lower my accept rate, the runtime would take a while longer.
Check whether collecting samples with invalid values of t.end
is skewing the other parameter estimates by fixing t.end
to the actual value and then comparing to the current set of results (where invalid t.end
can result in negative probabilities in the coalescent simulation, but only throws a warning and not an exception).
@gtng92 observed that fixing to actual t.end
is producing the negative probability error, so let's try using a uniform distribution with a lower bound at actual t.end
and upper bound some small amount above this value, and varying these bounds until we no longer get this error.
From Issue #120, I decided to implement the idea of incorporating a dummy tree upon an error encountered in rcolgem
. From there I was able to run a grid-search over a wide range of varying parameters for all params in theta
(except alpha
, which isn't used in an sir.dynamic
model --> still working on stripping that part out when it's irrelevant to the model)
True Params:
theta <- c(t.end=200, N=10000, beta=0.1, gamma=0.2, mu=0.01, alpha=5)
Grid-search Params:
t.end <- seq(50, 400, 50)
N <- seq(7000, 14000, 1000)
beta <- seq(0.01, 0.36, 0.05)
gamma <- seq(0.01, 0.36, 0.05)
mu <- seq(0.001, 0.071, 0.01)
alpha <- 5
Average of top 10 theta
s with lowest kernel distances:
avg.theta <- c(t.end=215, N=10900, beta=0.12, gamma=0.235, mu=0.03451, alpha=5)
Average of top 20 theta
s:
avg.theta <- c(t.end=217.5, N=10277.78, beta=0.115, gamma=0.23, mu=0.0365, alpha=5)
Ok, careful about this though -- if there was no real difference among trees with respect to kernel score, one could arbitrarily pick 10 or 20 trees and the mean parameter estimates would still be around these values based on the way you've designed the grid. It will be more convincing when we see the actual distribution of distances over the grid. Thanks!
From viewing the distributions, varying t.end
and N
seem to have little to no effect on the kernel distance. beta
and gamma
in particular seemed the most estimable (agrees with what was estimable before).
It's a little hard to see the distribution in this plot, but I can show the distributions at the dev meeting with a 3D scatterplot that is rotatable.
Trajectory of Mean t.end
seems to be improving, although admittedly my priors were relatively narrow. It does look better than the t.end
trajectory in the PDF file linked in the comment from Aug 17, 2017. Parameter beta
still seems to have a strange spike up to 40 iterations.
I think we're in danger of fishing for results here. Let's see what distances are like using other metrics. Thanks!
To check, I widened the prior to a normal distribution with a mean of 300 and a standard deviation of 100. True value was 200.
Deprecated by #123
Just like kernel-ABC validation