KlausVigo / phangorn

Phylogenetic analysis in R
http://klausvigo.github.io/phangorn/
205 stars 39 forks source link

SimSeq() with substitution models #62

Closed jphill01 closed 5 years ago

jphill01 commented 6 years ago

Hi Klaus,

This issue relates to to the one I asked a while back (and it was closed by you).

I've decided to use simSeq() to generate sequences under different substation models.

To do this, I create a neighbour-joining tree from a distance matrix. The tree is then used as input to simSeq().

dis <- dist.dna(seqs, model = "K80") # K2P divergences
tr <- nj(dis) # unrooted neighbour-joining tree
res <- as.DNAbin(simSeq(tr, l = ncol(seqs))) # convert to DNAbin format

My question is: do I still need to specify Q in sim.Seq as per the Advanced Features Vignette in phangorn? Table 2 in the vignette is a bit confusing in this regard.

By default, Q is NULL. What is Q automatically set to in this case?

Can you clarify?

Thanks!

Cheers,

Jarrett

KlausVigo commented 6 years ago

Hi @jphill01, if you you don't specify Q (and bf), all rates are assumed equal and you simulate under a Jukes-Cantor model. Regards, Klaus

jphill01 commented 5 years ago

Hi Klaus,

Sorry for the late reply.

Would the 'res' object in my code above simulate DNA sequences according to a K2P model (i.e., model = "K80") despite Q and rate objects still being NULL?

That is, the model I want is specified in dist.dna(), but because I'm only looking to simulate DNA sequences from a tree constructed from an existing sequence alignment (and not doing phylogenetic inference), do the model, Q and rate arguments still need to correspond?

Note, I would also specify the bf argument in simSeq() using bf <- base.freq().

Thanks.

KlausVigo commented 5 years ago

Hi @jphill01, the model is just used for amino acid models as these usually use defined/fixed transition matrices and state frequencies (WAG, LG etc.). The default for nucleotides is the JC69 model, i.e. equal base frequencies and equal rates. So for a K2P need a to supply something like Q = c(1,2,1,1,2,1) to define the transition transversion ratio. If you add additionally bf = base.freq(some_other_alignment), you would simulate under a HKY model, the K2P assumes equal base frequencies (the default). I should probably add a parameter kappa for the transition transversion ratio to ease computation of some DNA or codon models.
Regards, Klaus

ms609 commented 5 years ago

It would be useful if this information, specifically the default model for simSeq (Jukes-Cantor?), could be mentioned in the function documentation. It took me quite a while to find the information on this page. I'd be happy to propose a pull request if this would be helpful.

KlausVigo commented 5 years ago

Hi Martin @ms609, that would be nice. Of course the function simSeq.phylo can not know what parameters one intend to use, that's why I used Jukes-Cantor as default. simSeq.pml uses all the parameters provided, so it does parametric bootstrap (1 iteration).
Regards, Klaus

KlausVigo commented 5 years ago

Thanks Martin @ms609, I think the main problem is that an amino acid and nucleotide model are very different used, but it is not obvious during phylogenetic reconstruction.
For example a GTR model for DNA does not provide (!) but optimizes parameter of base frequencies and the rates in Q during phylogenetic reconstruction (and these will be different for every data set). In contrast the Dayhoff model takes rates and base frequencies for amino acids is derived and kindly provided by Margaret Dayhoff. Therefore simSeq only uses amino acid models in the model argument and the parameter sitting ready to use in some files in the folder phangorn/inst/extdata. Maybe I should add a warning or error message (that may crashes some users scripts) to clarify that simSeq expects only amino acid models.

Instead the defaults section I better describe what arguments (apart from rate) can or should be supplied to the function: DNA: bf and Q or maybe some day also tstv ratio. AA: model (and rarely bf).
CODON: dnds, tstv and bf (according to the genetic code provided). I should add the possibility to allow base frequencies and compute codon frequencies from this.
USER: bf and Q