model 8 - newer versions don't always converge - choosing extreme dN/dS instead

jayoung commented 1 year ago

hi there,

I recently upgraded from a quite old version (4.9a) to version 4.10.6.

I noticed that for some of our 'favorite' genes, the newer PAML version does NOT give robust evidence for sitewise positive selection (M8 versus M8a or M7, also M2 versus M1) when old PAML did. One such gene is MxA where we've got good experimental evidence for functional differences at the sites PAML picks out with BEB (nice!).

I think I have a small clue about what's going on. Bottom line is that the older version is better at converging on something like this for the positively selected class in M8 and M2 - I'll call this result A:

(p1 =   0.02162) w =   4.94585

but the newer version sometimes (not always) chooses something like this instead - I'll call this result B:

(p1 =   0.00082) w = 999.00000

From a biology standpoint, result A makes more sense - result B is too extreme to make practical sense (p1 often works out to less than one codon). Also from a numerical standpoint, result A seems better supported:

Result B (newer version) has a much worse ML (no better than M7 or M8a) than result A (significant LRT against with M8-8a, M8-7 and M2-1).
- When I see result A, it's usually robust to using different codon models (2 or 3) as well as different starting omega (0.4 or 3). When I see result B, it's less robust - sometimes changing codon model actually rescues it to result A (but because the finding changes with different starting parameters, we would have thrown out that result).

I'm seeing result B more often with version 4.10.6 than 4.9a. I've also tested versions 4.9g, 4.9h, 4.9j - I think they also come up with result B more often.

Did something change between 4.9a and 4.9g about the constraints on estimating p1 and omega of the positively selected class?

I think for our lab we'll stick to v4.9a for now, but I know that's getting old.

I'll attach a couple of alignments that demonstrate the behavior. Hope I'm making some sense here - happy to chat if I'm not.

thanks!

Janet

Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Division of Basic Sciences Fred Hutchinson Cancer Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fredhutch.org

jayoung commented 1 year ago

zz_exampleFiles.zip

ziheng-yang commented 1 year ago

sorry for being slow to get to this. you didn't include the control files in the zip file. could you send these to me. check before you send to make sure that the program runs and you have included all files. thanks, ziheng

ziheng-yang commented 1 year ago

There has been no change to the optimization routine, so i think it is just a matter of the overall difficulty of the optimisation routine, and the choice of some parameters/settings, rather than a systematic difference between versions. you can check and change some of the following control variables. SmallDiff = 1e-8 # use a valle between 1e-6 and 1e-9 method = 1 or 0

i tried version 4.10.6 on a windows laptop and on a linux server for the ACE2_primates_aln1_NT.fa.phy data. here are the results. i used NSsites= 0 1 2 7 8, but coopy results for M2 and M8 only.

Model 2: PositiveSelection (3 categories) lnL(ntime: 45 np: 50): -5828.826425 +0.000000 MLEs of dN/dS (w) for site classes (K=3) p: 0.63010 0.34142 0.02848 w: 0.00000 1.00000 4.99463

Model 8: beta&w>1 (11 categories) lnL(ntime: 45 np: 50): -5828.919316 +0.000000 p0 = 0.96728 p = 0.00584 q = 0.01095 (p1 = 0.03272) w = 4.69779

Model 8: beta&w>1 (11 categories) lnL(ntime: 45 np: 50): -5828.919319 +0.000000 Parameters in M8 (beta&w>1): p0 = 0.96727 p = 0.01341 q = 0.02542 (p1 = 0.03273) w = 4.69718

best, ziheng

jayoung commented 1 year ago

hi, Ziheng,

Thanks, for looking at this. No need to apologize for slowness: no I need to apologize for my own slow follow up (vacation last week).

I am now attaching a much more complete zip file with complete input and output files. I reran using Small_Diff = 1e-8 as you suggested, and I'm still seeing the same thing. The Excel spreadsheet summarizes all the results - hopefully you are able to open that (if not let me know).

On your system (in version 4.10.6), do you find the M8 result is robust to changing the codon model and/or starting omega? On mine you will see it is not.

I mostly run codeml on our linux cluster (either inside or outside a container) but I get the same issue on my Mac. It's the mac results I'm sending here, but I saw similar results on the linux machine.

I built the codeml executables as follows (in both cases there were warnings but no obvious errors):

v4.9a downloaded this: http://abacus.gene.ucl.ac.uk/software/SoftOld/paml4.9a.tgz unpacked cd src make

v4.10.6 downloaded this: https://github.com/abacus-gene/paml/archive/refs/tags/v4.10.6.tar.gz unpacked cd src make

I've only included results for ACE2 for now to keep it simple. For the Mx1 alignment I sent before, (relative to ACE2) codeml much more often converges on the "useful" solution that has much lower lnL (e.g. p1 = 0.02 w = 6.1, rather than p1=0.00001 w=999), and that solution is robust to changing starting omega and codon model. But I did notice for Mx1 that the computer system I run it on (and whether I compile with gcc or cc) can occasionally flip results from 'useful' type to the non-useful type.

thanks again,

Janet

testPAMLversions_2023_Feb27_filesToShare.zip

ziheng-yang commented 1 year ago

hi janet, i took a look at the differences between those two versions. i looked at model M2 but assume that the problem is the same for other models. i noticed that the initial values (the initial branch lengths for the tree) are different between the two versions you mentioned, and v4.10.6 produced branch lengths that are too small, so that the ML iteration starts from a poor place far away from the peak. you can see the initial values used in the screen output after the text : "initial w for M2:NSpselection reset."

In codeml.c, there is a line of code int getdistance = 0, i, j, k, idata, nc, nUVR, cleandata0; if you change the number from 0 to 1, codeml will generate better branch lengths as initial values. i think this is the major difference between the two versions.

alternatively, if you go to the paml github site, and click on releases or go to the following link: https://github.com/abacus-gene/paml/releases there is a version for mac. this should work better. it seems that i updated the code but didn't change the version number. let me know if this makes a difference. best wishes, ziheng

jayoung commented 1 year ago

hi Ziheng,

thanks for spending time on this. Yes! The getdistance starting value seems to be the key change. I now get the results I would expect for ACE2 (M2 and M8, all four combinations of my starting omega and codon model choices, running on the Mac for now).

I'll check my other test alignments too, as well as checking on my linux server (inside and outside a singularity container)

I just noticed that the tar.gz and zip source files listed for the 4.10.6 release are a lot older (Sept 24, 2021) than the pre-compiled executables (Dec 17 2022). That is definitely part of the problem - the pre-compiled binaries didn't work for me, so I compiled from source I'd downloaded from the release 4.10.6 area without noticing that the tarball was out of date.

Maybe I should just be compiling from the current master src on git, but I do like the idea of using a stable release, if I can think of it that way?

Janet

jayoung commented 1 year ago

Finished my checks on more alignments and on more different computers, and yes, now things look right.

Just to summarize: compiled executables made from the tar.gz and zip archives for the 4.10.6 release are behaving in the weird way I described at the top of this issue (unless I fix the int getdistance line), and the tar.gz and zip archives are older than they should be: I don't think they really contain a true v4.10.6.

Executables I compiled from the current git archive (commit af30c37) are behaving correctly.

By the way, in order to successfully compile the actual current version within my docker container, I had to remove -Wmemset-elt-size from the CFLAGS in the Makefile (because the docker container uses an older version of Ubuntu that doesn't recognize that flag. I use an older Ubuntu for reasons to do with other tools I'm using alongside PAML). If I understand right, removing that flag shouldn't make any difference, as -Wall should dominate (and even if it does make a difference, it simply gives me more warnings during compilation, rather than changing the executable at all). Does that make sense? If I shouldn't be messing with the CFLAGS please let me know.

thanks again!

Janet

abacus-gene / paml

model 8 - newer versions don't always converge - choosing extreme dN/dS instead #27