Gaius-Augustus / learnMSA

Learning and Aligning Large Protein Families (MSA-HMM)
MIT License
18 stars 2 forks source link

What's the format of the output MSA, fasta, a3m or a2m? #14

Open permia opened 10 hours ago

permia commented 10 hours ago

I aligned ~3k of the RdRP domain RNA virus. The output is like the following

>1
...................................................................................V--------...............................................................................-..................-PFSVWNRRFPAAQ.-QRIHEL......................................---...TYRETCGGYPE..EKDIRIKTFLKIEDTIQGIDSDMKVK................................................................-AA-RVISGMTPH-..VNVAMGPECLGI--.........---........................................................................................AKTLVKAFDGSDKICYTAGWSAE..AISEVLFGKR.....................RQT..DWS......G...DEELDQ..LHLDSIYLEQRN...........................................................................................................-...ARD...........KIF..Q.....MY........RFGLETDCS..VWDGSITIPLLEFEQWVFQSW...GHQ........................................................................................................----S......................................................................KRFVYS..IDSCECLAVLPTVSNIE..RMA................................................-QD....YQVWQRRQA..........................................-RIPCLIH..............SYMLS...-........SLKNLESLNTSLS--H---wv.................................T--------..........-.-TLPLYQMTHDYCD-CQLSQNWPTLDLGPSQFSTQAYLQKYPSVQHISGQQQMDPSLGQNP.-SE-SWPNLASHI-.QTCGVCLSMLLESL...---KACSTIPTTSLSFTKSYIGSCVMSIRKLMLQLSLKStDHTSQRDMSARRIRMHFSLTFT.......E.PHS.KSSCQDSNRPCHITNCHVSLPTICSTTASVLTCCKD.HLT..........................GHGDPDPLEAT.IHFVLLDFQNTTLPYNNVAQ...........AMPQKSKSNASK-KARSGQRKPTGK..RRIAQVKKAIIAAA--G---paAKPSYNAPTANAEPKYKPPKAASASGAAKEWVDSEDNAMAAGLAYYNMITDPFSTPARGGWVGDIEIGTQEGQDIMKISAPPTILNATYNANYKFCCVGIEAKASDPVVMMTTLGNTGAMTLSSAGNPPSWSGLVSNANLLITRCVGLKINNFNAFQQRNGRAYVLPRWGAYQNGSVVYPANVTDITYNDDTMVYDAANMPEDFLLTVKNELIAGEDLYAPTASVSLNQNTSNMAFVLFVFDGVD...................................................................................................................
>2
pnfrawvrkfpekvrr...................................................................RLEEAMAAMtghkpmadt......................................................................I..................GRVLFRPAFMKDEK.-ANTGYP......................................GGS...KISDPRLIQPG..SPELNAVWGPMFFAIAGMWKACYATH................................................................-SC-LVWAAGLTG-..DEIGDSMYQACCYT.........---........................................................................................NGFQFVENDCSRFDASVQRELLM..LRQDIYSLYF.....................DLD..VDS......P...MGMSYR..KLLKLMCDKQGI...........................................................................................................T...PHG...........IRY..K.....TV........GTVASGDGD..TSMWNFFLNSMADLFAYCTN-...PLFapdgas..................................................................................................LVTPL......................................................................VLYGTS..VLNHHDSQCQFAAVREA..WAQ................................................REA....LESKFEAGE..........................................-PEAWREH..............VATLE...H........WEAAQDWAAKTVL---PSMrlprdvlmmttsiasgsvpdsstatarrgpdsaglGLSENVTVPrehkddpsdaH.RQEWLLTAAASDDG-YVRGRA--AQSYVPTTRASLDGTHRFLVPATAASGPAFAARDVDRE.-ELDRAWAERKDA-.APVSGRLPS-SPLT...TDIPTAHLIERDAGRAHTYVTPGRWNDHAAMDSAVQPNV.RAP--GGTAGHHGAAFAHAPPS.......-.ARP.QRLDGLPSRRTLNETLEVGPSQLALYFVFLAHSEL-.EISsstrsflstwcesrnfsfadylrdlsALTPVQWLEWC.FYKSYTNGDDFSCIQGP-CN...........-PGLAHRHGVYQ-SLGFRPEFKTYQ..-EVAHTEFCSSVLM---PCY..-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------...................................................................................................................

etc.

The MSA seems to be in a2m format. So, I tried to covert the a2m to fasta format by reformat.pl script in hhsuite. However, the MSA seemed to be algined terribly. What's the format of the output MSA, fasta, a3m or a2m?

felbecker commented 10 hours ago

Hi, I wasn't aware of the a2m and a3m formats (which are technically still fasta). You assumed correct that learnMSA outputs a2m (convert to ordinary fasta by making everything upper case and replacing "." with "-").

I don't think anything went wrong in your conversion.

Otherwise: What do you meant exactly by "aligned terribly"? Could you send me input, learnMSA output and the learnMSA version and settings (if non default) you used?

permia commented 9 hours ago

What do you meant e

Three motifs A B C of RdRP domain weren't aligned compared to the muscle super. So, I think, maybe, that I misunderstand the format of the out.

I installed the learnMSA as the README and run the following commond:

learnMSA -i ./RdRP1_dedup.faa -o ./learnMSA_all.afa --sequence_weights

The input and output files are sent to you by e-mail (beckerfelix94).

felbecker commented 9 hours ago

Update: I checked reformat.pl. Perhaps its the character limit per line. learnMSA does not have a character limit, but reformat.pl seems to assumed one (100 per default). Using the argument -l "alignment length" in reformat.pl could help.

I'll also check your files.

I will add more options to configure the learnMSA output soon and make the documentation clearer.

permia commented 9 hours ago

Update: I checked reformat.pl. Perhaps its the character limit per line. learnMSA does not have a character limit, but reformat.pl seems to assumed one (100 per default). Using the argument -l "alignment length" in reformat.pl could help.

I'll also check your files.

I will add more options to configure the learnMSA output soon and make the documentation clearer.

Thanks for reply. I think that some sequences shared low similarity (or having longer insert or longer) to the other RdRP sequences, which made the alignment seem bad. Actually, most of the sequences (about 2500 in 3265 ) are well-aligned.

PS: I expected three motifs are aligned. DXXXX[D/E] [S/T]G [G/S/A]D[D/N]

felbecker commented 2 hours ago

I checked your files and I agree: learnMSA found your motifs, but some sequences seem to be aligned off, because of low amino acid similarity.

I was curious if aligning with language model support (--use_language_model) would improve the alignment (it should in your case). I send you the output, can you confirm that this alignment looks better than the one without pLM support?

To reproduce: I aligned with version 2.0.8 (published today) and learnMSA -i ./RdRP1_dedup.faa -o ./learnMSA2_all.e2m --sequence_weights --use_language_model.

Best, Felix