4ment / torchtree

A probabilistic framework in PyTorch for phylogenetic models
https://4ment.github.io/torchtree
GNU General Public License v3.0
12 stars 2 forks source link

Name and order of the relative frequencies and rates #7

Open maremita opened 1 year ago

maremita commented 1 year ago

Hello, In the CSV sampling file, the parameters of the substitution models are tagged by their index (substmodel.frequencies.[0, 1,..], substmodel.rates.[0, 1,..]) and not by their names (for example [A, G, C, T] and [AG, AC, AT, GC, GT, CT] for nucleotide). I couldn't find in the source code documentation of their name/order, and it's not easy to guess it from the code of q() that builds the substitution matrix of the GTR model.

It will be helpful to document their names somewhere to facilitate comparisons and post-analysis tasks.

Thank you Mathieu @4ment. Amine.

maremita commented 1 year ago

I ran some simulations with different frequencies and rates values and compared the results to PhyML. I think the order for torchtree is frequencies = [A, C, G, T] and rates = [AC, AG, AT, CG, CT, GT], which follows the standard order (sorted alphabetically and found in popular tools). However, we can find different orders in other tools and textbooks. For example the recent variational phylogenetics tool VBPI uses frequencies = [A, G, C, T] and rates = [AG, AC, AT, GC, GT, CT]. EvoVGM, a generative variational model, adopted the same order as VBPI.