PoonLab / clmp

Genetic clustering with Markov-modulated Poisson processes
https://shiny.filogeneti.ca/clmp/
GNU General Public License v3.0
1 stars 2 forks source link

Implement smoothing method #21

Closed ArtPoon closed 5 years ago

ArtPoon commented 5 years ago

clmp works better on NJ trees than ML trees, where the latter tends to contain more zero or near-zero branch lengths. This seems to prevent the MMPP model from fitting properly. We could use a method such as non parametric rate smoothing to adjust ML tree inputs. Of course this should be validated by simulation.

ArtPoon commented 5 years ago

This is a series of experiments I ran using the simulated tree fastTrans_9.labeled.nwk, which came from the original MMPP study.

  1. Original time-scaled tree:

    NoEffectAxis: standard deviation 0.1*2.12e-15 in principal axis 0 without effect
    log likelihood for 2 state model is 1258.014331
    rates: 45.242931 134.996906 
    Q: [    *   0.436792 ]
    [ 2.367241   *    ]

    res1

  2. RAxML tree reconstructed from INDELible simulation on tree from (1), midpoint rooted:

    log likelihood for 2 state model is 2371.923916
    rates: 185.981160 10000.000000 
    Q: [    *   418.811657 ]
    [ 10000.000000   *    ]

    res2

  3. Single-precision FastTree2 tree reconstructed from same INDELible simulation, midpoint rooted:

    log likelihood for 2 state model is 2138.805842
    rates: 375.846282 873.459329 
    Q: [    *   7.388263 ]
    [ 21.726021   *    ]

    res3

  4. Neighbor-joining tree (NINJA) reconstructed from same simulation, midpoint rooted:

    log likelihood for 2 state model is 2232.188977
    rates: 449.899698 1362.069165 
    Q: [    *   13.271125 ]
    [ 123.536773   *    ]

    res4

ArtPoon commented 5 years ago

This is set of histograms summarizing branch lengths in the original (red) and RAxML (blue) trees, respectively, where the original tree was rescaled to account for the clock rate in simulations: histograms

Note a substantial proportion of branches are effectively zero length in the RAxML tree:

> table(t2r$edge.length == min(t2r$edge.length))

FALSE  TRUE 
 1696   264 

whereas the neighbor-joining tree has far fewer minimal branch lengths:

> table(t4r$edge.length == min(t4r$edge.length))

FALSE  TRUE 
 1942    18 
ArtPoon commented 5 years ago

Our objective is to reconstruct a time-scaled tree from a molecular phylogeny.

Let's assume that sampling dates are not diverse enough to estimate a molecular clock (if they are, then one could use node.dating or treedater, for example). The expected number of substitutions (multiply branch length by alignment length) is a Poisson-distributed outcome (which we may have to round to the nearest integer).

seq.len <- 1197  # from hxb2_pol
y <- round(t2$edge.length * seq.len)

Ntip <- function(tr) {
  length(tr$tip.label)
}

is.tip <- function(tr) {
  tr$edge[,2] <= Ntip(tr)
}
# distribution of terminal branch lengths
barplot(table(y[is.tip(t2)]), space=0, border=NA, col='orange', 
        xlab='Expected number of substitutions', cex.lab=1.2)

internals

Note that the simulation model that produced the original tree had a mixture of internal branch lengths (two rate classes) so we don't expect this to look like a Poisson distribution. [EDIT: not that we should anyhow since this is a convolution of a Poisson and some underlying distribution of time lengths.]

# distribution of internal branch lengths
barplot(table(y[!is.tip(t2)]), space=0, border=NA, col='dodgerblue', 
        xlab='Expected number of substitutions', cex.lab=1.2)

tips

[EDIT: I mixed up the plots - in fact the upper plot is terminal branches and the lower is internal.]

ArtPoon commented 5 years ago

The problem here is that there are a substantial number of internal branches with effectively zero length. The maximum likelihood estimate for the time length of such a branch would also be zero. However it is quite plausible for zero substitutions to occur over a non-zero duration.

I think a way to address this (without your usual dated-tip approaches) is to have the user specify prior distributions for times on internal and terminal branch lengths, respectively. Since we don't really care about the height of the tree (TMRCA), the exact means of these distributions shouldn't really matter. Then I think we can use a Bayesian approach to smooth the branch time estimates.

ArtPoon commented 5 years ago

I've realized that I can fit a negative binomial distribution to the observed branch lengths, which is what you get from the Poisson distribution and a gamma conjugate prior on the rate parameter.

The log-likelihood of the expanded negative binomial (without integrating out the rate parameter) has a first-derivative solution for the rate parameter \frac{y+\alpha-1}{\beta-1} where \alpha and \beta are estimated by fitting the NBin to the data and y is the observed substitution count. Numerical optimization of the likelihood function gives a linear trend wrt. y that is consistent with this solution. However the parameters do not line up.

ArtPoon commented 5 years ago

Fit of negative binomial distribution to terminal branch lengths: fit

> require(MASS)
> fit.1 <- fitdistr(y[is.tip(t2r)], 'negative binomial')
> fit.1
     size         mu    
  4.3027167   5.0438420 
 (0.3923557) (0.1056821)

Look at shape of likelihood surface for given outcome y=0:

lkl <- function(li, yi, fit) {
  shape <- fit$est[1]  # size, r
  mu <- fit$est[2]
  scale <- mu/shape  # theta
  ppois(yi, li) * pgamma(li, shape=shape, scale=scale)
}
x <- seq(0, 5, 0.01)
plot(x, lkl(x, 0, fit.1), type='l', ylab='Likelihood', xlab='Rate parameter')

lkl

ArtPoon commented 5 years ago

Examine trend of MLE of rate parameter with outcomes:

mins <- sapply(0:15, function(yi) {
  obj.func <- function(x) {
    -lkl(x, yi, fit.1)
  }
  optimize(obj.func, c(0,100))$minimum
})
plot(0:15, mins)

mle

This boring linear trend is consistent with my derivation of MLE by differentiating the log-likelihood function: output

ArtPoon commented 5 years ago

Problem:

ArtPoon commented 5 years ago

shoot, I should be using dgamma, not pgamma. duh.

ArtPoon commented 5 years ago

Slope now matches theta/(1+theta) where theta=mu/size more simply 1/(beta+1) where beta=size/mu

ArtPoon commented 5 years ago

Intercept equals (size-1) / (beta+1) Why am I getting beta-1 in the derivative? :-(

ArtPoon commented 5 years ago

Crap. Forgot negative sign - in exp of Poisson distribution. I fail to math.

ArtPoon commented 5 years ago

Revised formulation: output

ArtPoon commented 5 years ago

Bleh - Erik and Simon already figured this out, although their domain of application is slightly different. Just finish implementing the method and cite their paper.

ArtPoon commented 5 years ago

Implemented method in R/clockify.R. It doesn't improve things for the RAxML tree:

> t2c <- clockify(t2r, 1197/3)
> res.5 <- clmp(t2c)
ConditionNumber: maximal condition number 9.01e+12 reached. maxEW=1.98e-01,minEW=1.74e-14,maxdiagC=2.11e-01,mindiagC=8.34e-15
log likelihood for 2 state model is 2405.088475
rates: 0.000000 777.988713 
Q: [    *   2310.671793 ]
   [ 0.000000   *    ]
ArtPoon commented 5 years ago

Histogram of internal branch lengths and negative binomial fit: hist2

ArtPoon commented 5 years ago

The problem might be that the ML estimates of branch lengths from RAxML are too invariant around substitution counts (blue = RAxML tree, red = NJ tree): pulses

ArtPoon commented 5 years ago

I tried sampling branch lengths from the gamma-poisson likelihood function (we can use rgamma with alpha'=alpha+y and beta'=beta+1) but that didn't help, so I went back to comparing single-precision FastTree (blue) and RAxML (red): RAxMLvsFT2 The main difference seems to be the shift of the smallest branch lengths toward zero for the RAxML tree.

So let's do something really simple and set the minimum branch lengths in the RAxML tree (t2r) to the minimum length in the FastTree tree (t3r):

> pad <- t2r
> min(pad$edge.length)
[1] 1.000001e-06
> min(t3r$edge.length)
[1] 5e-04
> pad$edge.length[pad$edge.length==min(pad$edge.length)] <- min(t3r$edge.length)
> res.pad <- clmp(pad)
log likelihood for 2 state model is 2202.862674
rates: 433.440312 1031.637816 
Q: [    *   9.106233 ]
   [ 23.591857   *    ]

respad