abacus-gene / paml

PAML is a program package for model fitting and phylogenetic tree reconstruction using DNA and protein sequence data. Please report only **technical issues** on this repository (e.g., compiling, programs abort or do not run at all, etc.). Problems with input data and general questions should be posted at https://groups.google.com/g/pamlsoftware?pli
GNU General Public License v3.0
122 stars 20 forks source link

ndata of MCMCtree analysis using Single Copy orthologs in multiple genomes #50

Closed HyeonseonPark closed 4 months ago

HyeonseonPark commented 4 months ago

Dear Dr. Ziheng Yang,

Hello, I am Hyeonseon Park, currently conducting research on the genome sequencing and comparative genomics of plants in the Poaceae family. To infer the divergence times of these species, I am planning to perform an MCMCTree analysis. I have identified around 300 single-copy orthologs across 11 species.

I am wondering whether I should set ndata=300 in the control (.ctl) file for this analysis. I have noticed in other research papers that multiple single-copy orthologs are concatenated and analyzed with ndata=1. Could you please explain the difference between these approaches and advise on which method is more appropriate?

Thank you for developing such insightful software for evolutionary biology research.

Best regards,

Hyeonseon Park

sabifo4 commented 4 months ago

Hi there!

As explained in the PAML Wiki, the GitHub repository is aimed at issues related to technical issues. For non-technical questions and debates, there is a PAML discussion group that people are encouraged to use -- you may even find that similar questions to yours have been posted there (e.g., use the search bar and use key words to find the answer to your question). While this issue should have been posted on the PAML discussion group, please find below some guidelines to answer your question.

Option ndata is used to specify the number of alignment blocks in your input sequence file. You will need to decide whether you want to partition (i.e., have more than one alignment block) or concatenate (i.e., have one big alignment block with one gene concatenated after another) your data based on the biological question you want to answer and the type of data you have. There are various papers that have tried to assess the effect of partitioning (e.g., Angelis et al. 2017 and reference therein), so you may consult them to decide whether you want to partition your data or not. Note that, for deep divergences, it is common to concatenate the sequences (e.g., Mahendrarajah et al., 2023). You may also want to read a bit more into the basics of Computational Molecular Evolution (Yang, 2014) to better understand what types of analyses can be carried out, as well as the PAML documentation and PAML Wiki if you are to use PAML programs -- important to understand the format of input files, the options that you need to specify in the control files, etc.

I shall now close this issue as it is not related to any technical problems. If you have further non-technical questions, please post them in the PAML discussion group :)

All the best, Sandy