abacus-gene / paml

PAML is a program package for model fitting and phylogenetic tree reconstruction using DNA and protein sequence data. Please report only **technical issues** on this repository (e.g., compiling, programs abort or do not run at all, etc.). Problems with input data and general questions should be posted at https://groups.google.com/g/pamlsoftware?pli
GNU General Public License v3.0
122 stars 20 forks source link

DO ASR with PAML,But all ancestral sequences I got have same length? #20

Closed penglbio closed 1 year ago

penglbio commented 2 years ago

Dear Doc. Yang,

I don't know why all ancestral sequences I got by paml4.6 have same length. No matter I use my data or example data in the paml4.6. Following is my output(I changed some sites of the example data stewart.aa to check if there was dash in ancestral sequences, I marked the sites I changed with []):

List of extant and reconstructed sequences

10    130

Langur KIFERCELAR TLKKLGLDGY KGVSLANWVC LAKWESGYNT EATNYNPGDE STDYGIFQIN SRYWCNNGK[-] PGAVDACHIS CSALLQNNIA DAVACAKRVV SDPQGIRAWV AWRNHCQNKD VSQYVKGCGV Baboon KIFERCELAR TLKRLGLDGY RGISLANWVC LAKWESDYNT QATNYNPGDQ STDYGIFQIN SHYWCNDGK[-] PGAVNACHIS CNALLQDNIT DAVACAKRVV SDPQGIRAWV AWRNHCQNRD VSQYVQGCGV Human KVFERCELAR TLKRLGMDGY RGISLANWMC LAKWESGYNT RATNYNAGDR STDYGIFQIN SRYWCNDGK- PGAVNACHLS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNRCQNRD VRQYVQGCGV Rat KTYERCEFAR TLKRNGMSGY YGVSLADWVC LAQHESNYNT QARNYDPGDQ STDYGIFQIN SRYWCNDGK- PRAKNACGIP CSALLQDDIT QAIQCAKRVV RDPQGIRAWV AWQRHCKNRD LSGYIRNCGV Cow KVFERCELAR TLKKLGLDGY KGVSLANWLC LTKWESSYNT KATNYNPSSE STDYGIFQIN SKWWCNDGK- PNAVDGCHVS CSELMENDIA KAVACAKKIV SE-QGITAWV AWKSHCRDHD VSSYVEGCTL Horse KVFSKCELAH KLKAQEMDGF GGYSLANWVC MAEYESNFNT RAFNGKNANG SSDYGLFQLN NKWWCKDNK- RSSSNACNIM CSKLLDENID DDISCAKRVV RDPKGMSAWK AWVKHCKDKD LSEYLASCNL node #7 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESNYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #8 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESGYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #9 KIFERCELAR TLKRLGLDGY RGISLANWVC LAKWESGYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV SDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #10 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESNYNT QATNYNPGDE STDYGIFQIN SKWWCNDGKL PGAVNACHIS CSELLEDNIA DAVACAKRVV RDPQGITAWV AWRNHCQDRD VSQYVQGCGL

############################################################ codeml.ctl file like following: seqfile = stewart.aa sequence data filename treefile = stewart.trees tree structure file name outfile = mlc * main result file name

    noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
  verbose = 1  * 0: concise; 1: detailed, 2: too much
  runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
               * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

  seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2  * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
ziheng-yang commented 1 year ago

the ML prorgams in paml, including baseml and codeml, do not deal with alignment gaps properly and treat them as ambiguities.
as a result, they try to reconstruct amino acids for internal nodes. you can delete some columns if they have alignment gaps in most sequences, and keep columns in which most sequences have data. you can then try to guess whether the internal nodes have alignment gaps. given the tree, you can code the column with alignment gaps as 0-1 characters and use a parsimony argument to guess whether the ancestral nodes have gaps.

At any rate, treatment of alignment gaps is improper in essentially all likelihood programs, including raxml, mrbayes, etc. you can easily find descriptions of the problem. the paml doc pamlDOC.pdf has this text: "Note that alignment gaps are treated as missing data in baseml and codeml (if cleandata = 1). If cleandata = 1, all sites with ambiguity characters and alignment gaps are removed." you can also read the following section in my book, Yang 2014: 4.2.6 Missing data, sequence errors, and alignment gaps

ziheng