Closed ArtPoon closed 5 years ago
This is a series of experiments I ran using the simulated tree fastTrans_9.labeled.nwk
, which came from the original MMPP study.
Original time-scaled tree:
NoEffectAxis: standard deviation 0.1*2.12e-15 in principal axis 0 without effect
log likelihood for 2 state model is 1258.014331
rates: 45.242931 134.996906
Q: [ * 0.436792 ]
[ 2.367241 * ]
RAxML tree reconstructed from INDELible simulation on tree from (1), midpoint rooted:
log likelihood for 2 state model is 2371.923916
rates: 185.981160 10000.000000
Q: [ * 418.811657 ]
[ 10000.000000 * ]
Single-precision FastTree2 tree reconstructed from same INDELible simulation, midpoint rooted:
log likelihood for 2 state model is 2138.805842
rates: 375.846282 873.459329
Q: [ * 7.388263 ]
[ 21.726021 * ]
Neighbor-joining tree (NINJA) reconstructed from same simulation, midpoint rooted:
log likelihood for 2 state model is 2232.188977
rates: 449.899698 1362.069165
Q: [ * 13.271125 ]
[ 123.536773 * ]
This is set of histograms summarizing branch lengths in the original (red) and RAxML (blue) trees, respectively, where the original tree was rescaled to account for the clock rate in simulations:
Note a substantial proportion of branches are effectively zero length in the RAxML tree:
> table(t2r$edge.length == min(t2r$edge.length))
FALSE TRUE
1696 264
whereas the neighbor-joining tree has far fewer minimal branch lengths:
> table(t4r$edge.length == min(t4r$edge.length))
FALSE TRUE
1942 18
Our objective is to reconstruct a time-scaled tree from a molecular phylogeny.
Let's assume that sampling dates are not diverse enough to estimate a molecular clock (if they are, then one could use node.dating or treedater, for example). The expected number of substitutions (multiply branch length by alignment length) is a Poisson-distributed outcome (which we may have to round to the nearest integer).
seq.len <- 1197 # from hxb2_pol
y <- round(t2$edge.length * seq.len)
Ntip <- function(tr) {
length(tr$tip.label)
}
is.tip <- function(tr) {
tr$edge[,2] <= Ntip(tr)
}
# distribution of terminal branch lengths
barplot(table(y[is.tip(t2)]), space=0, border=NA, col='orange',
xlab='Expected number of substitutions', cex.lab=1.2)
Note that the simulation model that produced the original tree had a mixture of internal branch lengths (two rate classes) so we don't expect this to look like a Poisson distribution. [EDIT: not that we should anyhow since this is a convolution of a Poisson and some underlying distribution of time lengths.]
# distribution of internal branch lengths
barplot(table(y[!is.tip(t2)]), space=0, border=NA, col='dodgerblue',
xlab='Expected number of substitutions', cex.lab=1.2)
[EDIT: I mixed up the plots - in fact the upper plot is terminal branches and the lower is internal.]
The problem here is that there are a substantial number of internal branches with effectively zero length. The maximum likelihood estimate for the time length of such a branch would also be zero. However it is quite plausible for zero substitutions to occur over a non-zero duration.
I think a way to address this (without your usual dated-tip approaches) is to have the user specify prior distributions for times on internal and terminal branch lengths, respectively. Since we don't really care about the height of the tree (TMRCA), the exact means of these distributions shouldn't really matter. Then I think we can use a Bayesian approach to smooth the branch time estimates.
I've realized that I can fit a negative binomial distribution to the observed branch lengths, which is what you get from the Poisson distribution and a gamma conjugate prior on the rate parameter.
The log-likelihood of the expanded negative binomial (without integrating out the rate parameter) has a first-derivative solution for the rate parameter \frac{y+\alpha-1}{\beta-1}
where \alpha
and \beta
are estimated by fitting the NBin to the data and y
is the observed substitution count. Numerical optimization of the likelihood function gives a linear trend wrt. y
that is consistent with this solution. However the parameters do not line up.
Fit of negative binomial distribution to terminal branch lengths:
> require(MASS)
> fit.1 <- fitdistr(y[is.tip(t2r)], 'negative binomial')
> fit.1
size mu
4.3027167 5.0438420
(0.3923557) (0.1056821)
Look at shape of likelihood surface for given outcome y=0
:
lkl <- function(li, yi, fit) {
shape <- fit$est[1] # size, r
mu <- fit$est[2]
scale <- mu/shape # theta
ppois(yi, li) * pgamma(li, shape=shape, scale=scale)
}
x <- seq(0, 5, 0.01)
plot(x, lkl(x, 0, fit.1), type='l', ylab='Likelihood', xlab='Rate parameter')
Examine trend of MLE of rate parameter with outcomes:
mins <- sapply(0:15, function(yi) {
obj.func <- function(x) {
-lkl(x, yi, fit.1)
}
optimize(obj.func, c(0,100))$minimum
})
plot(0:15, mins)
This boring linear trend is consistent with my derivation of MLE by differentiating the log-likelihood function:
Problem:
shape
and mu
from fitdistr
are 4.3027167 and 5.0438420shape
is the same as alpha
, and beta
is alpha/mu
- these numbers give us a beta
less than 1, which makes the MLE negativeshoot, I should be using dgamma
, not pgamma
. duh.
Slope now matches theta/(1+theta)
where theta=mu/size
more simply 1/(beta+1)
where beta=size/mu
Intercept equals (size-1) / (beta+1)
Why am I getting beta-1
in the derivative? :-(
Crap. Forgot negative sign -
in exp
of Poisson distribution. I fail to math.
Revised formulation:
Bleh - Erik and Simon already figured this out, although their domain of application is slightly different. Just finish implementing the method and cite their paper.
Implemented method in R/clockify.R
. It doesn't improve things for the RAxML tree:
> t2c <- clockify(t2r, 1197/3)
> res.5 <- clmp(t2c)
ConditionNumber: maximal condition number 9.01e+12 reached. maxEW=1.98e-01,minEW=1.74e-14,maxdiagC=2.11e-01,mindiagC=8.34e-15
log likelihood for 2 state model is 2405.088475
rates: 0.000000 777.988713
Q: [ * 2310.671793 ]
[ 0.000000 * ]
Histogram of internal branch lengths and negative binomial fit:
The problem might be that the ML estimates of branch lengths from RAxML are too invariant around substitution counts (blue = RAxML tree, red = NJ tree):
I tried sampling branch lengths from the gamma-poisson likelihood function (we can use rgamma
with alpha'=alpha+y and beta'=beta+1) but that didn't help, so I went back to comparing single-precision FastTree (blue) and RAxML (red):
The main difference seems to be the shift of the smallest branch lengths toward zero for the RAxML tree.
So let's do something really simple and set the minimum branch lengths in the RAxML tree (t2r
) to the minimum length in the FastTree tree (t3r
):
> pad <- t2r
> min(pad$edge.length)
[1] 1.000001e-06
> min(t3r$edge.length)
[1] 5e-04
> pad$edge.length[pad$edge.length==min(pad$edge.length)] <- min(t3r$edge.length)
> res.pad <- clmp(pad)
log likelihood for 2 state model is 2202.862674
rates: 433.440312 1031.637816
Q: [ * 9.106233 ]
[ 23.591857 * ]
clmp works better on NJ trees than ML trees, where the latter tends to contain more zero or near-zero branch lengths. This seems to prevent the MMPP model from fitting properly. We could use a method such as non parametric rate smoothing to adjust ML tree inputs. Of course this should be validated by simulation.