Closed mhampton closed 5 months ago
Not sure what you're expecting it to do here. hmmalign
is for aligning only (sub)sequences that are homologous to the model. If you have one or more domains in a target sequence, they need to be extracted first as individual subsequences to be aligned. Here you're asking it to align a longer sequence containing two domains + other nonhomologous sequences in a global alignment to a single domain profile. That's not what hmmalign
does, and it doesn't really make sense (what are you expecting the output MSA to look like?)
Here, your 662aa NP_565030.1 sequence has a hit to the PF01061.27 model at 389..594. So your "testing" sequence, with the 662aa x2 duplication, now has two identical hits in it, one at 389..594 and one at 1017..1222. hmmalign
, with its probability model configured to find exactly one domain, expecting to align individual domains, sees that as a kind of alignment ambiguity: neither one of those domains has >50% probability of being correctly aligned to the model, if the model probabilities expect only one domain per sequence. So it doesn't call either one as a reliable single-domain alignment; it calls essentially the whole sequence as unaligned because it can't find a confident single domain. And with the --trim
option, it trims off the flanking nonhomologous stuff, which in this case is essentially the entire sequence.
This is behaving the way it's designed, and it's not a bug. The solution is to give homologous single-domain (sub)sequences as input to hmmalign for alignment. This is documented.
My apologies, I thought hmmalign could be used in a more general way.
If by "a more general way" you mean you want to search sequences for one or more homologous domains, defined by score or statistical significance thresholds, and output an alignment of all of these subsequences... then use the -A
option for hmmsearch
: hmmsearch -A <out_msa> <hmmfile> <seqfile>
.
Hi, I think there is a problem in hmmalign when the sequence file contains two or more copies of the domain if the copies are very similar. As an example, using the Pfam model PF01061.27.hmm (an ABC transporter domain), there is an excellent match to the sequence AtABCG25 given below, but if I double that sequence and make small changes, the output is garbage. Example sequences: