PoonLab / Kaphi

Kernel-embedded ABC-SMC for phylodynamic inference
GNU Affero General Public License v3.0
4 stars 2 forks source link

MASTER Newick output tree errors #120

Closed gtng92 closed 6 years ago

gtng92 commented 6 years ago

Errors that I'm currently working on catching that relate to the MASTER output of newick trees:

Error 1

Writing Newick output...
Warning: Newick writer skipping empty graph. (repeated 1 times)
Done.
# which later when Kaphi receives it throws the error
Error in x[[i]] : subscript out of bounds

Error 2

Writing Newick output...
Done.
Error in if (tp[3] != "") obj$node.label <- tp[3] : 
  missing value where TRUE/FALSE needed
gtng92 commented 6 years ago

Error 2 A temp.newick file was generated for the set of trees that threw the error. Looking into the file, there are five trees, but 1 or more of them are written as: 1:0.1;

When I attempt to read in the tree, it gives the same error.

test <- read.tree(text='1:0.1;')
Error in if (tp[3] != "") obj$node.label <- tp[3] : 
  missing value where TRUE/FALSE needed

Stochastically, the run will output non-trees without the requested number of tips.

gtng92 commented 6 years ago

Error 1

Generating inheritance trajectory 1 of 5
Generating inheritance trajectory 2 of 5
Generating inheritance trajectory 3 of 5
Generating inheritance trajectory 4 of 5
Generating inheritance trajectory 5 of 5
Writing Newick output...
Warning: Newick writer skipping empty graph. (repeated 1 times)
Done.
Length of new.trees = 4

So there are 5 trajectories required within new.trees, but only 4 are retained. Implementing a check for 5 trees within newick file.

Error 2 Currently implementing a check for an empty file, or any errors with reading in the newick formatted file. If so, call function within perturb.particles again for a new set of theta params.

gtng92 commented 6 years ago

At dev meeting, we decided on a different approach for catching twigs and stumps' exceptions.

Twigs: incorrectly interpreted 1:0.1; needs to be retained as a viable (dummy) tree Stumps: returned as dummy trees

Dummy tree: something that will spike that particle out of the next iteration. It could mean forcing a high kernel distance, but there are also other tree distances to keep in mind.

gtng92 commented 6 years ago

For now, the current 'dummy tree' is hard-coded as (1:0.1,1:0.1):0;

ArtPoon commented 6 years ago

Run validation experiments using each of the two approaches:

  1. spike particle with dummy trees for failed simulations
  2. drop failed simulations as missing data and calculate acceptance probability using remaining trees (if there are none, then drop the particle).

It sounds more like we have to be smarter about choosing our prior distributions, however, and also making use of the ability to constrain prior distributions (where one rate parameter is assumed to be greater or lesser than another).

ArtPoon commented 6 years ago

For the case that kernel distances appear to be the same for trees simulated over a range of parameter values in MASTER, please visually examine the trees generated at the extreme end points of the parameter range to see whether they "look" different.

gtng92 commented 6 years ago

Modified Grid-search Run 1

True Params:

theta <- c(t.end=0.2, N=10000, beta=1, gamma=5, phi=3)

Grid-search Params:

t.end <- seq(0.1, 1.5, 0.2)     # tsample = 0.1
N <- seq(7000, 14000, 1000)
beta <- seq(0.2, 3.2, 0.4)
gamma <- seq(2.8, 5.6, 0.4)
phi <- seq(2.2, 5.0, 0.4)

Distances for modified t.end and beta parameters (example)

gridsearch-epidem

The grid looked similar when comparing distances across modified t.end and N parameters.

When t.end (simulationTime in MASTER) is > 1.0, kernel distances are high, and parameters N (Susceptible populationSize) and beta (infection rate) do not have a strong effect on the distance.

I've also noticed that the distances are very similar with any tree that has t.end < 1.0 .

I looked again into Project kamphir in file kamphir/drivers/MASTER.SIR.py and found default params to be:

# default parameter values
context = {
    'beta': 0.001,  # transmission rate
    'gamma': 0.3,  # mortality rate
    'N': 1000,      # total population size
    't_end': 30,    # length of simulation
    'phi': 0.15,    # sampling rate
    'ntips': 100,    # number of tips in tree
    'nreps': 10,     # number of trees to generate
    'outfile': outfile
}

I will change the true value of t.end to 30 and perform a grid-search from t.end <- seq(10, 45, 5) and re-run the simulation.

ArtPoon commented 6 years ago

With that N and beta, the epidemic is going to move very fast through the population -- for relatively large t.end the epidemic will have long since swamped the population and I think the associated tree will look like an "E".
Thanks for posting this!

On Oct 26, 2017, at 4:45 PM, Tammy Ng notifications@github.com wrote:

True Params:

theta <- c(t.end=0.2, N=10000, beta=1, gamma=5, phi=3) Grid-search Params:

t.end <- seq(0.1, 1.5, 0.2 )

N <- seq(7000, 14000, 1000 )

beta <- seq(0.2, 3.2, 0.4 )

gamma <- seq(2.8, 5.6, 0.4 )

phi <- seq(2.2, 5.0, 0.4) Distances for modified t.end and beta parameters (example)

The grid looked similar when comparing distances across modified t.end and N parameters.

When t.end (simulationTime in MASTER) is > 1.0, kernel distances are high, and parameters N (Susceptible populationSize) and beta (infection rate) do not have a strong effect on the distance.

I've also noticed that the distances are very similar with any tree that has t.end < 1.0 .

I looked again into Project kamphir in file kamphir/drivers/MASTER.SIR.py and found default params to be:

default parameter values

context

{

'beta': 0.001, # transmission rate

'gamma': 0.3, # mortality rate

'N': 1000, # total population size

't_end': 30, # length of simulation

'phi': 0.15, # sampling rate

'ntips': 100, # number of tips in tree

'nreps': 10, # number of trees to generate

'outfile' : outfile }

I will change the true value of t.end to 30 and perform a grid-search from t.end <- seq(10, 45, 5) and re-run the simulation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

gtng92 commented 6 years ago

Modified grid-search Run 2

True Params:

theta <- c(t.end=30, N=1000, beta=0.001, gamma=0.3, phi=0.15)

Grid-search Params:

t.end <- seq(20, 55, 5)           # tsample = 20
N <- seq(700, 1400, 100)
beta <- seq(0.0005, 0.004, 0.0005)
gamma <- seq(0.1, 0.8, 0.1)
phi <- seq(0.05, 0.4, 0.05)

None of the parameters were estimable within these parameter ranges.

ArtPoon commented 6 years ago

Ok, time to start looking at bringing in other distance functions. We should be able to pick up t.end and N. At the extremes of those parameters, the trees should look totally bogus.

Please constrain tsampl to be equal to t.end (simulation duration) for the case that there is only one sampling time.

It looks like the reason we can't estimate N is because the sampling time is fixed for the grid search, although the simulation end time varies -- this means that the time between the MRCA and the tips is always the same.

ArtPoon commented 6 years ago

Some code for looking at grid search results:

gr <- read.csv('~/Desktop/volcano/gridsearch.run5.csv')
names(gr) <- c('index', 't.end', 'N', 'beta', 'gamma', 'phi', 'distance')

z <- matrix(0, nrow=8, ncol=8)
for (i in 1:8) {
  for (j in 1:8) {
    x <- unique(gr$beta)[i]
    y <- unique(gr$t.end)[j]
    foo <- gr$distance[gr$beta==x & gr$t.end==y]
    z[i,j] <- mean(sort(foo)[1:200])
  }
}
filled.contour(x=unique(gr$beta), y=unique(gr$t.end), z, color.palette=terrain.colors, legend=F)
abline(h=30)
abline(v=1e-3)
ArtPoon commented 6 years ago

The validation experiments described above are being carried forward in another issue (#42)