Closed penglbio closed 1 year ago
the ML prorgams in paml, including baseml and codeml, do not deal with alignment gaps properly and treat them as ambiguities.
as a result, they try to reconstruct amino acids for internal nodes. you can delete some columns if they have alignment gaps in most sequences, and keep columns in which most sequences have data. you can then try to guess whether the internal nodes have alignment gaps. given the tree, you can code the column with alignment gaps as 0-1 characters and use a parsimony argument to guess whether the ancestral nodes have gaps.
At any rate, treatment of alignment gaps is improper in essentially all likelihood programs, including raxml, mrbayes, etc. you can easily find descriptions of the problem. the paml doc pamlDOC.pdf has this text: "Note that alignment gaps are treated as missing data in baseml and codeml (if cleandata = 1). If cleandata = 1, all sites with ambiguity characters and alignment gaps are removed." you can also read the following section in my book, Yang 2014: 4.2.6 Missing data, sequence errors, and alignment gaps
ziheng
Dear Doc. Yang,
I don't know why all ancestral sequences I got by paml4.6 have same length. No matter I use my data or example data in the paml4.6. Following is my output(I changed some sites of the example data stewart.aa to check if there was dash in ancestral sequences, I marked the sites I changed with []):
List of extant and reconstructed sequences
Langur KIFERCELAR TLKKLGLDGY KGVSLANWVC LAKWESGYNT EATNYNPGDE STDYGIFQIN SRYWCNNGK[-] PGAVDACHIS CSALLQNNIA DAVACAKRVV SDPQGIRAWV AWRNHCQNKD VSQYVKGCGV Baboon KIFERCELAR TLKRLGLDGY RGISLANWVC LAKWESDYNT QATNYNPGDQ STDYGIFQIN SHYWCNDGK[-] PGAVNACHIS CNALLQDNIT DAVACAKRVV SDPQGIRAWV AWRNHCQNRD VSQYVQGCGV Human KVFERCELAR TLKRLGMDGY RGISLANWMC LAKWESGYNT RATNYNAGDR STDYGIFQIN SRYWCNDGK- PGAVNACHLS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNRCQNRD VRQYVQGCGV Rat KTYERCEFAR TLKRNGMSGY YGVSLADWVC LAQHESNYNT QARNYDPGDQ STDYGIFQIN SRYWCNDGK- PRAKNACGIP CSALLQDDIT QAIQCAKRVV RDPQGIRAWV AWQRHCKNRD LSGYIRNCGV Cow KVFERCELAR TLKKLGLDGY KGVSLANWLC LTKWESSYNT KATNYNPSSE STDYGIFQIN SKWWCNDGK- PNAVDGCHVS CSELMENDIA KAVACAKKIV SE-QGITAWV AWKSHCRDHD VSSYVEGCTL Horse KVFSKCELAH KLKAQEMDGF GGYSLANWVC MAEYESNFNT RAFNGKNANG SSDYGLFQLN NKWWCKDNK- RSSSNACNIM CSKLLDENID DDISCAKRVV RDPKGMSAWK AWVKHCKDKD LSEYLASCNL node #7 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESNYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #8 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESGYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV RDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #9 KIFERCELAR TLKRLGLDGY RGISLANWVC LAKWESGYNT QATNYNPGDQ STDYGIFQIN SRYWCNDGKL PGAVNACHIS CSALLQDNIA DAVACAKRVV SDPQGIRAWV AWRNHCQNRD VSQYVQGCGV node #10 KVFERCELAR TLKRLGMDGY RGISLANWVC LAKWESNYNT QATNYNPGDE STDYGIFQIN SKWWCNDGKL PGAVNACHIS CSELLEDNIA DAVACAKRVV RDPQGITAWV AWRNHCQDRD VSQYVQGCGL
############################################################ codeml.ctl file like following: seqfile = stewart.aa sequence data filename treefile = stewart.trees tree structure file name outfile = mlc * main result file name
ndata = 10 clock = 0 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis aaDist = 0 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a aaRatefile = ../dat/jones.dat * only used for aa seqs with model=empirical(_F)
dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own
model = 2
models for AAs or codon-translated AAs:
NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
13:3normal>0
icode = 0 * 0:universal code; 1:mammalian mt; 2-10:see below Mgene = 0
fix_kappa = 0 1: kappa fixed, 0: kappa to be estimated kappa = 2 initial or fixed kappa fix_omega = 0 1: omega or omega_1 fixed, 0: estimate omega = .4 initial or fixed omega, for codons or codon-based AAs
fix_alpha = 1 0: estimate gamma shape parameter; 1: fix it at alpha alpha = 0. initial or fixed alpha, 0:infinity (constant rate) Malpha = 0 different alphas for genes ncatG = 8 # of categories in dG of NSsites models
RateAncestor = 1 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
Small_Diff = .5e-6 cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
fix_blength = -1 0: ignore, -1: random, 1: initial, 2: fixed method = 0 Optimization method 0: simultaneous; 1: one branch a time
Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt.,
7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt.,
10: blepharisma nu.
These codes correspond to transl_table 1 to 11 of GENEBANK.