abacus-gene / paml

PAML is a program package for model fitting and phylogenetic tree reconstruction using DNA and protein sequence data. Please report only **technical issues** on this repository (e.g., compiling, programs abort or do not run at all, etc.). Problems with input data and general questions should be posted at https://groups.google.com/g/pamlsoftware?pli
GNU General Public License v3.0
103 stars 19 forks source link

Inconsistent LRT results for site models after altering parameters #34

Closed Clementine0702 closed 1 year ago

Clementine0702 commented 1 year ago

Hello PAML Community,

I am currently working with an alignment of 54 primate sequences, using codeml Nsite models 0, 1, 2, 3, 4, 7, and 8. Since I recently upgraded from PAML version 4.10.5 to 4.10.7, I noticed that in the control file downloaded with the package (v. 4.10.7), the three parameters "cleandata," "fix_blength," and "method" are ignored with asterisks. We want to use “cleandata=1” (because there are a couple of places where there is missing data for a few sequences) and “fix_blength=0” (because our tree topologies are reliable but our branch lengths are not). We don’t have any preference with regard to “method”. We were unsure what the default setting where when the three parameters are preceded by * and thus ignored so we ran some tests. We were also aiming to check the convergence with the previous runs (v. 10.4.5), so we modified the three parameters and found that changing them made significant differences in the results obtained from the LRT test.

Table 1: LRT results of tests with different settings. 


Runs #
Version of model Settings for cleandata, fix_blength, and method LRT of M1 vs M2 LRT of M7 vs M8
1 4.10.7 Linux cleandata = 1 fix_blength = 0  method = 1 insignificant(LR=-0.06244) insignificant(LR= -0.73016)
2 4.10.7 Linux cleandata = 1 fix_blength = 0  *method = 1 significant(LR =10.185022) significant(LR=20.272906)
3 4.10.7 Linux cleandata = 1 fix_blength = 0  method = 0 insignificant(LR = 5.331764) significant(LR=12.917131999998674)

*The other parameters are all kept the same for the three runs, except for cleandata, fix_blength, and method. As an example, the control file for run 1 is attached below, and the aligned sequences and tree file we used are attached in the zip file, along with the control files for all three runs.

Figure 1: The control file of run 1

Screenshot 2023-07-26 at 5 13 56 PM

Questions:

Since the parameter method = 1 specifies a newer algorithm than method=0, we thought it made sense to set it to 1. However, changing it from 0 to 1 (runs 1 and 3, above) changes one of the LR values significantly, which concerns me. What sorts of issues with the data, or real phenomena, could lead to method = 0/1 making such a big difference? What other tests could I run to gain more confidence in the results? Simply ignoring the three parameters (as in the control file that comes with the package) also changes the LRT from insignificant (if we use the parameters we thought we choose to use, run 1) to significant. Given our test runs above, this seems to be due to “method=1” vs “method=0”, so this question is probably related to the previous one: in such situations, where a simple change in parameters leads to drastic changes in the results, how should one proceed? Note: we have also run different methods that estimate dN/dS, such as the FEL and MEME methods from hyphy, and got significant results and some sites with evidence for dN/dS>1. We are aware that those models/methods have many differences, but perhaps this helps in solving this mystery.

We are very grateful for any insights or suggestions that anyone has. Best regards paml_files.zip

ziheng-yang commented 1 year ago

The optimisation algorithm in codeml is not safe, so you need to arrange multiple runs to confirm convergence of the algorithm. if you have thousands of genes, you can run the same analysis multiple times and then use the set of results corresponding to the highest lnL.

i suspect that some of the differences you describe may be due to problems with the optimisation algorithm. in other words running the same analysis multiple times using the same version may produce different results.

the values of the control variables "cleandata," "fix_blength," and "method" are in a particular control file may be in random states. there are no default or recommended values, as these depend on your datasets and analyses. here are some notes about what they do. you can look at the documentation pamlDOC.pdf for more details. also you may find some past posts that discuss those issues.

cleandata = 0 versus 1: this changes the data and may well affect the results.

method = 0 or 1, fix_branch=0 or 1. these only affect the optimisation algorithm, or the initial values, and should in theory lead to the same results. if the datasets are not very large, method = 0 may be preferable.

best wishes, ziheng