ArtPoon / kamphir

Phylogenetic inference using a tree-shape kernel in an Approximate Bayesian Computation framework
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Method validation #14

Closed ArtPoon closed 9 years ago

ArtPoon commented 9 years ago

Kamphir seems to be working fairly well on recovering SI model parameters from a simulated tree when trees are normalized by mean branch length - however, the parameter estimates are most accurate when the duration of simulation t.end is most similar to the original simulation. In the following screen cap, the true value of beta is 0.01 (1E-2) and target tree depth about 1500.

trace

This implies that there exists a ridge of high kernel score in parameter space where higher beta combined with a shorter simulation time can result in a congruently similar tree shape, i.e., when length-normalized.

I have tried turning of branch length normalization. This requires setting tau to some value that is representative of typical branch lengths (for example, a branch in the target tree may be 100 weeks, so tau should be on order of 100). Kernel score takes a huge hit. Inexplicably, this causes the algorithm to lose its ability to recover N. I'm not sure how this could be associated with normalizing trees.

ArtPoon commented 9 years ago

Does it make a difference if we use a full-dimensional update or component-wise update (-gibbs)? Evaluate with replicate samples on projects/hivepi/data/SI.beta001.N10E5.nwk.

ArtPoon commented 9 years ago

I have verified that turning off branch length normalization skews estimation of N for some reason. Component-wise updating makes the chain slower to burn in. It also has a higher overall proposal acceptance rate in exchange for fewer proposals to N, so it works out about the same as full-dimensional updating. Chains converge strongly to beta=0.01 and about N=11,000; the latter is slower to converge.

ArtPoon commented 9 years ago

I think setting sigma too small for a parameter in the proposal function makes the chain susceptible to wandering aimlessly along this axis of parameter space, causing it to get trapped in lower kernel score values. Here is an example from using rcolgem to directly fit the simulated tree SI.beta001.N10E5.nwk (SI.beta001.N10E5.log.11).

rplot

My reasoning is that if the step size is too small, coupled with uncertainty in estimating kernel score (since we are simulating a limited number of data sets for given parameters), then the MCMC process becomes a random walk that is not informed by kernel score.

Here is a plot of N (population size): rplot

Actually, now I think the problem is that as N gets very large, the proposal sigma remains the same (it was set to 1000.0 for this particular run), which may make it very difficult for the chain to get back down from 100,000 to 10,000 or so (the actual value was 10,000) if the posterior surface is fairly level for high values of N. Maybe this proposal should operate on a log scale for N?

ArtPoon commented 9 years ago

Currently I am validating the method on /MASTER.SIR.beta1E-3.gamma03.phi015.N1000.n100.labels.nwk. This is a tree of 274 tips that was generated using MASTER with the script SIRTree.xml that is currently in the top level directory (should probably move into /projects). These are simulations of an SIR process with transmission (birth) rate beta=0.001, removal rate gamma=0.3 and sampling rate phi=0.15 in a population of 1000 individuals (S0=999). Note the forward-time simulation was executed requesting 100 tips or more.

ArtPoon commented 9 years ago

Hopefully increasing the proposal sigma for gamma and phi from 0.001 to 0.005 will improve mixing. I have also increased the number of replicates from 10 to 20 for more accurate estimation of kernel score.

ArtPoon commented 9 years ago

Note that beta is strongly correlated with S0. Here are plots from BEAST2 and Kamphir to illustrate: rplot

This is probably why the BEAST authors usually report R0 (=beta*S0/gamma) and not beta. Also note that the true values in these cases are beta=0.001 and S0=999. (Fitting SIR model with MASTER to the simulated sequences, MASTER.SIR.n100.beauti.xml and MASTER.SIR.n100.timetree.nwk respectively).

One approach to addressing this issue is to structure the proposal function with some covariance matrix over the affected parameters.

ArtPoon commented 9 years ago

BEAST2 is processing at rate 1h16m58s/Msamples