KlausVigo / phangorn

Phylogenetic analysis in R
http://klausvigo.github.io/phangorn/
203 stars 38 forks source link

gamma shape completely change after optimisation #156

Closed trilisser closed 11 months ago

trilisser commented 11 months ago

Hi there! I try to simulate sequences using a tree with the script below.

tree <- read.tree("tree.treefile")
seqHKY_unOpt <- simSeq(tree, l = 1000, Q = c(1,2,1,1,2,1), rate = 0.1)

fit <- pml(tree, seqHKY_unOpt, k=4, shape = 1, Q = c(1,2,1,1,2,1))
fit <- optim.pml(fit, model="HKY", optRate = TRUE, optEdge = F, optQ = T, optGamma = T, optBF = T)
seqHKY_opt <- simSeq(fit)
write.phyDat(seqHKY_opt, "seqHKY_shape1_newnamed.fasta", format = "fasta")

But after optimisation shape became ~999 and I think it is about connection of a substitution rate and alpha parameter of gamma. How do they connect?

KlausVigo commented 11 months ago

Dear @trilisser, seqHKY_unOpt <- simSeq(tree, l = 1000, Q = c(1,2,1,1,2,1), rate = 0.1) does simulate sequences with only one rate. Here rate is not a parameter from the gamma distribution. Setting rate here is the same as multiplying the edge length by 0.1. When optimising the shape parameter gives you a very high value (1000 is the upper limit for the optimisation) all four rates converge towards one. So this is expected if you simulated with only one rate. See plot_gamma_plus_inv(1000). Here one rate would be enough.

There is a small example simulate under a (discrete) gamma distribution in the help:

# Example to simulate discrete Gamma rate variation
rates <- discrete.gamma(1,4)
data1 <- simSeq(tree, l = 250, rate=rates[1])
data2 <- simSeq(tree, l = 250, rate=rates[2])
data3 <- simSeq(tree, l = 250, rate=rates[3])
data4 <- simSeq(tree, l = 250, rate=rates[4])
data <- c(data1,data2, data3, data4)

or you can do it like this:

tree <- read.tree("tree.treefile")
seq_1000 <- simSeq(tree, l = 1000) # we just want a sequence of length 1000 
fit1 <- pml(tree, seq_1000, k=4, shape = 1, Q = c(1,2,1,1,2,1), rate=.1) # define the model
seqHKY_unOpt <- simSeq(fit) # simSeq extracts the model parameters
fit <- pml(tree, seqHKY_unOpt, k=4, shape = 1, Q = c(1,2,1,1,2,1))
fit <- optim.pml(fit, model="HKY", optRate = TRUE, optEdge = F, optQ = T, optGamma = T, optBF = T)
fit$rate
fit
write.phyDat(seqHKY_opt, "seqHKY_shape1_newnamed.fasta", format = "fasta")
trilisser commented 11 months ago

Dear @KlausVigo Thank you so much! Can you please comment the 5th line of your second script

fit <- pml(tree, seqHKY_unOpt, k=4, shape = 1, Q = c(1,2,1,1,2,1))

We define the model the second time, do we? If so, why do we do that?

KlausVigo commented 11 months ago

Here I just changed to the simulated data set. The optimised parameters (line 6, based on the pml object created in line 5) should be similar to the parameters supplied to simSeq (shape = 1, Q = c(1,2,1,1,2,1), rate=.1) in line 3.

trilisser commented 5 months ago

Thank you very much @KlausVigo ! It helps a lot!