EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
305 stars 69 forks source link

Bug in hmmalign for multiple near-identical domains #322

Closed mhampton closed 5 months ago

mhampton commented 5 months ago

Hi, I think there is a problem in hmmalign when the sequence file contains two or more copies of the domain if the copies are very similar. As an example, using the Pfam model PF01061.27.hmm (an ABC transporter domain), there is an excellent match to the sequence AtABCG25 given below, but if I double that sequence and make small changes, the output is garbage. Example sequences:


>NP_565030.1 ATP-binding cassette family G25 [Arabidopsis thaliana]
MSAFDGVENQMNGPDSSPRLSQDPREPRSLLSSSCFPITLKFVDVCYRVKIHGMSNDSCNIKKLLGLKQK
PSDETRSTEERTILSGVTGMISPGEFMAVLGPSGSGKSTLLNAVAGRLHGSNLTGKILINDGKITKQTLK
RTGFVAQDDLLYPHLTVRETLVFVALLRLPRSLTRDVKLRAAESVISELGLTKCENTVVGNTFIRGISGG
ERKRVSIAHELLINPSLLVLDEPTSGLDATAALRLVQTLAGLAHGKGKTVVTSIHQPSSRVFQMFDTVLL
LSEGKCLFVGKGRDAMAYFESVGFSPAFPMNPADFLLDLANGVCQTDGVTEREKPNVRQTLVTAYDTLLA
PQVKTCIEVSHFPQDNARFVKTRVNGGGITTCIATWFSQLCILLHRLLKERRHESFDLLRIFQVVAASIL
CGLMWWHSDYRDVHDRLGLLFFISIFWGVLPSFNAVFTFPQERAIFTRERASGMYTLSSYFMAHVLGSLS
MELVLPASFLTFTYWMVYLRPGIVPFLLTLSVLLLYVLASQGLGLALGAAIMDAKKASTIVTVTMLAFVL
TGGYYVNKVPSGMVWMKYVSTTFYCYRLLVAIQYGSGEEILRMLGCDSKGKQGASAATSAGCRFVEEEVI
GDVGMWTSVGVLFLMFFGYRVLAYLALRRIKH

>testing
MSAFDGVENQMNGPDSSPRLSQDPREPRSLLSSSCFPITLKFVDVCYRVKIHGMSNDSCNIKKLLGLKQK
PSDETRSTEERTILSGVTGMISPGEFMAVLGPSGSGKSTLLNAVAGRLHGSNLTGKILINDGKITKQTLK
RTGFVAQDDLLYPHLTVRETLVFVALLRLPRSLTRDVKLRAAESVISELGLTKCENTVVGNTFIRGISGG
ERKRVSIAHELLINPSLLVLDEPTSGLDATAALRLVQTLAGLAHGKGKTVVTSIHQPSSRVFQMFDTVLL
LSEGKCLFVGKGRDAMAYFESVGFSPAFPMNPADFLLDLANGVCQTDGVTEREKPNVRQTLVTAYDTLLA
PQVKTCIEVSHFPQDNARFVKTRVNGGGITTCIATWFSQLCILLHRLLKERRHESFDLLRIFQVVAASIL
CGLMWWHSDYRDVHDRLGLLFFISIFWGVLPSFNAVFTFPQERAIFTRERASGMYTLSSYFMAHVLGSLS
MELVLPASFLTFTYWMVYLRPGIVPFLLTLSVLLLYVLASQGLGLALGAAIMDAKKASTIVTVTMLAFVL
TGGYYVNKVPSGMVWMKYVSTTFYCYRLLVAIQYGSGEEILRMLGCDSKGKQGASAATSAGCRFVEEEVI
SAFDGVENQMNGPDSSPRLSQDPREPRSLLSSSCFPITLKFVDVCYRVKIHGMSNDSCNIKKLLGLKQK
PSDETRSTEERTILSGVTGMISPGEFMAVLGPSGSGKSTLLNAVAGRLHGSNLTGKILINDGKITKQTLK
TGFVAQDDLLYPHLTVRETLVFVALLRLPRSLTRDVKLRAAESVISELGLTKCENTVVGNTFIRGISGG
ERKRVSIAHELLINPSLLVLDEPTSGLDATAALRLVQTLAGLAHGKGKTVVTSIHQPSSRVFQMFDTVLL
LSEGKCLFVGKGRDAMAYFESVGFSPAFPMNPADFLLDLANGVCQTDGVTEREKPNVRQTLVTAYDTLLA
PQVKTCIEVSHFPQDNARFVKTRVNGGGITTCIATWFSQLCILLHRLLKERRHESFDLLRIFQVVAASIL
CGLMWWHSDYRDVHDRLGLLFFISIFWGVLPSFNAVFTFPQERAIFTRERASGMYTLSSYFMAHVLGSLS
MELVLPASFLTFTYWMVYLRPGIVPFLLTLSVLLLYVLASQGLGLALGAAIMDAKKASTIVTVTMLAFVL
TGGYYVNKVPSGMVWMKYVSTTFYCYRLLVAIQYGSGEEILRMLGCDSKGKQGASAATSAGCRFVEEEVI

output on the second is:

mh@Dumas bio % hmmalign --trim PF01061.27.hmm test_hmmalign.fa 
# STOCKHOLM 1.0

testing         -Q------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#=GR testing PP .4......................................................................................................................................................................................................
#=GC PP_cons    .4......................................................................................................................................................................................................
#=GC RF         xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

testing         ----------
#=GR testing PP ..........
#=GC PP_cons    ..........
#=GC RF         xxxxxxxxxx
//
cryptogenomicon commented 5 months ago

Not sure what you're expecting it to do here. hmmalign is for aligning only (sub)sequences that are homologous to the model. If you have one or more domains in a target sequence, they need to be extracted first as individual subsequences to be aligned. Here you're asking it to align a longer sequence containing two domains + other nonhomologous sequences in a global alignment to a single domain profile. That's not what hmmalign does, and it doesn't really make sense (what are you expecting the output MSA to look like?)

Here, your 662aa NP_565030.1 sequence has a hit to the PF01061.27 model at 389..594. So your "testing" sequence, with the 662aa x2 duplication, now has two identical hits in it, one at 389..594 and one at 1017..1222. hmmalign, with its probability model configured to find exactly one domain, expecting to align individual domains, sees that as a kind of alignment ambiguity: neither one of those domains has >50% probability of being correctly aligned to the model, if the model probabilities expect only one domain per sequence. So it doesn't call either one as a reliable single-domain alignment; it calls essentially the whole sequence as unaligned because it can't find a confident single domain. And with the --trim option, it trims off the flanking nonhomologous stuff, which in this case is essentially the entire sequence.

This is behaving the way it's designed, and it's not a bug. The solution is to give homologous single-domain (sub)sequences as input to hmmalign for alignment. This is documented.

mhampton commented 5 months ago

My apologies, I thought hmmalign could be used in a more general way.

cryptogenomicon commented 5 months ago

If by "a more general way" you mean you want to search sequences for one or more homologous domains, defined by score or statistical significance thresholds, and output an alignment of all of these subsequences... then use the -A option for hmmsearch: hmmsearch -A <out_msa> <hmmfile> <seqfile>.