ddarriba / modeltest

Best-fit model selection
GNU General Public License v3.0
73 stars 21 forks source link

Cannot parse the msa: Wrong sites read #11

Open RvV1979 opened 6 years ago

RvV1979 commented 6 years ago

Hi Diego,

I am testing modeltest-ng on some of my aa alignments in fasta format and so far I always get an "Cannot parse the msa: Wrong sites read" error. For example $modeltest-ng -d aa -i ../Alignments/OG0010000.fa -T raxml -p 6

produces

modeltest: Cannot parse the msa: ../Alignments/OG0010000.fa
           [10200]: Wrong sites read [Seq 1] 1005 vs 1006
Error: Invalid arguments
Try `modeltest --help` for more information

What may be going wrong here?

Thanks

OG0010000.fa:

>Egrandis_Eucgr.J01883.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------------MGVSYRPHPRRPSNLLPCLIALSLFSISALLLYKVDDFASQTK
TVAGHNLDPTPWHLFPPKTFNEKTRYARASKIIQCSYLTCPYA-----TGSIRGQDLSRS
RSA-----RACPAFFAWIRRDLEPWVRTGISPAHLMEAKRFASFRVVIFEGKLYVDFYYA
CVQSRAMFTIWGLLQLLRRYPGMVPDVDLMFDCMDKPSINRTEHASMPLPLFRYCTTPGH
FDIPFPDWSFWGWPETNLKPWDEEFRDIKQGSQVLRWSKKSPYAYWKGNPDVESPVRTEL
LKCNHSRMWNAQVMRQDWAEEARAGYEQSKLSNQCNHRYKIYAEGYAWSVSLKYIIACGS
PALIISPEYEDFFSRGLFPMRNYWPISSTNLCPSIKYAVNWGNANPSEAEAIGKRGQDFM
EDLSMDRIYDYMYHLIMEYSKLQNFKPIPSSSAREVCVDSLLCFADP-KQRQFLERSTAL
ASEEAPCTFKAARGITVTSWIKQKEKAIQDVRKKEMLSAERQ*---
>Athaliana_AT1G07220.1
M-----------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------GLRLR-------------------
-----------------LRLPHKSSPRSPSYLLLCVLALSFFSFTALLFYKVDDFIAQTK
TLAGHNLEPTPWHIFPRKSFSAATKHSQAYRILQCSYFSCPYK-----AVVQPKSLHSES
GSGRQTHQPQCPDFFRWIHRDLEPWAKTGVTKEHVKRAKANAAFRVVILSGKLYVDLYYA
CVQSRMMFTIWGILQLLTKYPGMVPDVDMMFDCMDKPIINQTEYQSFPVPLFRYCTNEAH
LDIPFPDWSFWGWSETNLRPWEEEFGDIKQGSRRRSWYNKQPRAYWKGNPDVVSPIRLEL
MKCNHSRLWGAQIMRQDWAEEAKGGFEQSKLSNQCNHRYKIYAEGYAWSVSLKYILSCGS
MTLIISPEYEDFFSRGLLPKENYWPISPTDLCRSIKYAVDWGNSNPSEAETIGKRGQGYM
ESLSMNRVYDYMFHLITEYSKLQKFKPEKPASANEVCAGSLLCIAEQ-KERELLERSRVV
PSLDQPCKFPVEDRNRLEWLIQQKNKTIENVRYMEMTRTQRGSK--
>Pan_PanWU01x14_asm01_ann01_134260.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------------MRLGTRNMRRSPSYLLSCVAALAFLSLTVLVLHKVDDFATQTK
TVVGHNLQPTPWHLFPPKTFSEETRQARVYKILHCSYLACSGY-----TTKSYATERRRS
ASA-DAASQKCPEFFKWIRRDFEPWSRAGISKSHLREAQEFAAFRVVIVGGRLFVDLYYA
CVQSRAMFTIWGLLQLLRRYPGMLPDVDMMFDCMDKPSINQTEHGSMPLPLFRYCTTKAH
FDIPFPDWSFWGWPETNLNPWDEEFRDIKHGSQKMNWTRRWPRAFWKGNPDVGSPVRTEL
LNCNHSRKWGAHILRQDWSKEAKEGYEKSKLSNQCNYRYKIYAEGYAWSVSLKYILSCGS
LALIISPQYEDFFSRGLIPKKNYWPISFSDLCPSIKYAVDWGNAHPSEATAIGKAGQDFM
ASLSMNRVYDYMFHLIKEYSKLQHFKPVRPSSALEVCSESLLCLADE-KQRQLLEKSIAY
PSPSPPCFL*------------------------------------
>Gmax_Glyma.11G174900.1
M--------------CHSTPQ---------------------------------------
---------------------------------PQCT-----------------------
------------------------------------------------------------
------------------------------------------------------------
KTTLF-------------------------------------------------------
----------LLLLL---------------------------------------------
----------------------------------------------------HHSQFFSS
---------------------------------------SS-------------------
----------------NMGPSSTHTPRSPTYLIPCVIALALFSLTGLLLYKVDDVASRTG
TVVGHNLEPTPWHVFPHKPFDEESRQQRAYKILQCSYLTCRYA-----AEALGGARRRT-
----GGGREECPKFFRAIHRDLAPWSESRISKAHVAAAQRYAAFRVVIVEGKVFVDWYYA
CVQSRAMFTLWGLLQLMRRYPGMVPDVDMMFDCMDKPSVNKTEHQAMPLPLFRYCTTKEH
FDIPFPDWSFWGWSEINIRPWQEEFPDIKRGSRSVTWKNKLPWAYWKGNPDVASPIRTEL
INCNDSRKWGAEIMRQDWGEAARNGFKQSKLSDQCNHRYKIYAEGYAWSVSLKYILSCGS
VALIISPQYEDFFSRGLIPNHNFWLVDPLNLCPSIKYAVEWGNQHPVEAEAIGKRGQDLM
ESLNMNRIYEYMFHLISDYSKLQDFKPTPPPTALEVCVESVLCFADE-KQRMFLNKSFTF
PSHKPPCNLKPA*---------------------------------
>Tor_TorRG33x02_asm01_ann01_187660.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------------MGLGTRNMPRSPSYLLPCVAALAFLSLTVLVLHKVDDFATQTK
TVVGHNLQPTPWHLFPPKTFSEETRQARVYKILHCSYLACSGY-----TTKSYATERRRS
ASA-DAASQKCPEFFKWIRRDLEPWSRTGISKSHLREAREFAAFRVVIVGGRLFVDLYYA
CVQSRAMFTIWGLLQLLRRYPGMVPDVDIMFDCMDKPSINRTEHGSMPLPLFRYCTTKAH
FDIPFPDWSFWGWPETNLNPWDEEFRDIKHGSQKMNWTRRWPRAFWKGNPDVGSPVRTEL
LNCNHSRKWGAQILRQDWSKEAKEGYEKSKLSNQCNYRYKIYAEGYAWSVSLKYILSCGS
LALIISPQYEDFFSRGLIPKKNYWPISFSDLCPSIKYAVDWGNAHPSEATAIGKGGQDFM
ASLSMNRVYDYMFHLIKEYSKLQHFKPVRPSSALEVCSESLLCLADE-KQRQLLEKSIAH
PSPTPPCFL*------------------------------------
>Ptrichocarpa_Potri.001G252000.1
M-----------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-AAPL-------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------SR-------------------
---------------------NKAPARLSSSPLLWIIALA--SLTVFFLYKVDNLALQTK
TVAGHNLPPTPWHLFPPKNFDDQSRHARAYQILHCSYLTCPYS-----NTTVSKGHGFNS
----PSSSPKCPRLFMFIHHDLEPWAQSRITVDHIMGAKNYASFRVVIYKGRLYLDPYYA
CVQSRMMFTIWGFLQLLKRYPGMVPDVDIMFDCMDKPSINKTEHDSFPLPLFRYCTTKDH
FDIPFPDWSFWGWPEVNIRPWDEEFRDIKRGAQARSWPKKWPRAYWKGNPDVGSPTRTSL
LECNHTKKWGAQIMRQDWEEEAKGGYVSSKLSHQCDYRYKIYAEGFAWSVSLKYIISCGS
LALIISPQYEDFFSRGLIPEKNYWPVSSDGLCQSIKFAVDWGNTNPTEAQKIGKAGQDLM
ESLSMDRVYDYMFHLISEYSKLQDFKPVPPSSALEVCVDSLTCFADE-KQKRFFERATAF
PSPSPPCTLQPANSDFIKSWMQQKQRTITNVREMELKA*-------
>Fvesca_Fvesca.gene07900
MEDPETDRNHHESVPSSSSPSTSQTQEQTEQQEQQRRSTSTFSYRVNVLISDVAAFDMKD
DFWSGFIVLVTFWFFASMTMILGFFGSADLQLGPNCSRLIRTNPFFVQTIKLLNVHLTDL
VFKQAQEIEGTKPGSILYGFYIPPPLDDEIAWTETHHTLEWIYFLNKGSRIDIYYHVKPS
SSSPLTLVIAQGRESLVEWIEDPSYGSGKIHQVISESSSYYIAVGNLNPEDVEADLEFNI
KSTLYNTSQAYYRCSIHNQYCGLELSLLGTTSAVLTSPGPKEGIQEDDWYVKVSYGPRWL
IYFLGSGVMTVLLLMAFRFCSIFQNAGEDRTDFQGREASPVRTPLLLPKDDDALSWGSSY
ECISNDEEDFEEFLGVSSLEGKSINGGDDNNTLRLCVICFDGPRDCFFLPCGHCATCFTC
GTRISEEAGTCPICRMKMKKGPKAVYLESNRKSEISGIQSRIMSYRIPTRVAHLCGPPNA
ATNSITFFFIFIIFIIVMGTLPRTTSRTPSYILRSVIALSCLSL-LFIIYKVNDFASQTK
TVAGHNLDPTPWHPFRPKTFNHDALPSRAYKLLQCSYLACRYTNNKGNATTLEDEDHRRS
ATS-VAKSPQCPEVFRWIHQDLEPWATTKITPVHIEQAKRYAAFRVVIYKGKLYVDLYYA
CVQSRAMFTIWGLLQMLKRYPGRVPDVDFMFDCMDKPTIERSEHASMPLPLFRYCTNDAH
YDIPFPDWSFWGWPETNIEPWDEQFQAIKRGGQETTWRKKELLAYWKGNPDVGSPVRKEL
LNCNDSKTYRAQILKQDWDEEAKAGFQQSSLSKQCNHRYKIYAEGYAWSVSLKYILSCGS
LALVITPQYEDFFSRGLFPKVNYWPISANAICPSIKSAVDWGHGHVPEAQAIGRRGQEYM
ENLNMDRVYDYMFHLITEYSKLQNFKPVPPNTAMEVCANSVLCLADDAKQREYLERTRAK
PFPGPPCDLQPPDGNLINKWKKEKAAVFQTVEDMNLVNDIPEPVP*
>Mtruncatula_Medtr3g464000.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------------MAHSSKHTPRTPTYLLPCVIALSFFSLTALLLYKVDDVASRTG
TVVGHNLEPTPWHVFPTKSFDEETRQSRAYKIIQCSYLTCGSADS---GGKTGPSFAAK-
----DAKQQNCPDFFRAIRKDLEPWKKTKISKGHLVEAQKYAAFRVVIVGGKLFVDWYYA
CVQSRAMFTVWSLLQLLRRYPGLVPDVDLMFDCMDKPSINKTEHASMPLPLFRYCTTKGH
FDIPFPDWSFWGWPEINIRPWQEEFPDIKQGAQVVSWKNKNPLAYWKGNPDVASPLRTEL
LTCNDSMKWGAEIMRQDWDAAARSGFQESKLSKQCNHRYKIYAEGYAWSVSLKYILSCGS
VALIIRPQYEDFFTRGLVPLQNFLPVDPLDLCPSIKRNVDWGNKHPKEAAALGKRGQDYM
ESLNIDRIYDYMFHLISEYSKLLDFKPALPSTALVVCEESVLCFADE-KQRSFLSRSTVS
PSQTPPCTLKPA*---------------------------------
>Gmax_Glyma.18G062600.1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------------MGPSSKHTPRSPTYLIPCVIALALFSLTGLLLYKVDDVASRTG
TVVGHNLEPTPWHVFPHKPFDEESRQQRTYKILQCSYLTCRYA-----AGAVGGSRRSF-
----AGGREECPEFFRAIHRDLAPWLESRISKAHVAAAQRYAAFRVVIVEGKVFVDWYYA
CVQSRAMFTLWGLLQLMRRYPGKVPDVDMMFDCMDKPSVNRTEHQAMPLPLFRYCTTKEH
FDIPFPDWSFWGWSEINIRPWQEEFPDIKQGSRNVSWKNKFPWAYWKGNPDVASPIRTEL
INCNDSRKWGAEIMRQDWGEAARSGFKQSKLSNQCNHRYKIYAEGYAWSVSLKYILSCGS
VALIISPQYEDFFSRGLIPNHNFWLVDSLNLCPSIKYAVEWGNQHPVEAEAIGKRGQDFM
GSLNMDRIYEYMFHLISEYSKLQDFKPTPPTTALEVCVESVLCFADE-KQRMFLNKSTAF
PSHKPPCNLKPA*---------------------------------
ddarriba commented 6 years ago

Hi RvV1979,

The error happens because ModelTest ignores the '*' (star) character, which occurs once in each sequence except in the second one (i.e., [Seq 1] given that first one is [Seq 0]). Therefore, it counts one fewer character for every sequence (1005) while for Seq1 it counts 1006.

Is there any particular reason for using that character?

Best, Diego.

RvV1979 commented 6 years ago

Hi Diego,

Thanks for the clarifications. The star stands for the stop codon (translation of AUG). They are quite common in protein alignments so it would be great if modeltest could accommodate them appropriately.

Best, Robin

On Tue, Feb 6, 2018 at 6:46 PM ddarriba notifications@github.com wrote:

Hi RvV1979,

The error happens because ModelTest ignores the '*' (star) character, which occurs once in each sequence except in the second one (i.e., [Seq 1] given that first one is [Seq 0]). Therefore, it counts one fewer character for every sequence (1005) while for Seq1 it counts 1006.

Is there any particular reason for using that character?

Best, Diego.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ddarriba/modeltest/issues/11#issuecomment-363505659, or mute the thread https://github.com/notifications/unsubscribe-auth/AJeGgxdITenj0mq30ctqJxk527pynfF6ks5tSI_3gaJpZM4R3OJL .

ddarriba commented 6 years ago

You're right, thanks!

Actually I figured out that libpll supports '*' character in amino acid sequences, but it is stripped out in fasta parser. Nonetheless, in our analyses it is treated as a gap/missing chararcter, so it could be also replaced by '-' or '?'.

I fixed that.