EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

hmmalign avoid flanking insertions #254

Closed btski closed 1 year ago

btski commented 3 years ago

Currently using HMMER 3.1b2 (February 2015).

I've been using hmmalign to help find the position of genes within a genome. Fairly frequently, matching terminal ends (near beginning/end) will be skewed, resulting in excess trimming. For example, the alignment output might have an unmatched nucleotide followed by several hundred gaps, then the aligned residues.

test.fasta.txt test.hmm.txt

hmmalign test.hmm.txt test.fasta.txt

Output:

c---{~2000 gaps}---GATTGTCCTGTAAGGAAGACT . . . etc
6...{~2000 dots}...567899*************** . . . etc

Unsure if this behavior is present in more recent versions of HMMER. Will test if/when I can get access other version/s.

cryptogenomicon commented 3 years ago

That seems reasonable even though it may look weird. The c is being called as a flanking insertion (nonhomologous) relative to the consensus. It's saying that there's either a ~2000 or ~1999 residue deletion of the start of your consensus model, depending on whether it includes the c as aligned vs. not; and it slightly prefers not to consider the c as part of the homologous region. (I say slightly, based on the confidence values shown.) It's quite reasonable that the alignment it's showing is the better one. It would take time to check it mathematically; I have no reason to expect it's wrong though.

hmmalign, by default, reports your entire input sequence (including nonhomologous flanks) and does not trim. It's forced to throw the c out to the side, if it thinks it's actually nonhomologous. If you want to remove nonaligned flanks, use hmmalign --trim.

If you have a reason to think that this alignment is definitely wrong (and not just a judgement call about where that c goes, and not just weird-looking to you), let me know and I'll have a closer look.

btski commented 3 years ago

Thanks for the response! I don't disagree that the alignment is optimal. However, for my application it is an inaccurate interpretation, and I'm not sure how to resolve the issue without making vague assumptions that might not always be reasonable. Do you have any recommendations on how I might avoid flanking insertions?

cryptogenomicon commented 3 years ago

I don't think we have a way of forcing that in the current code for hmmalign. You'd want to change the alignment mode from single-hit glocal (global in profile, local in sequence) to global (in both). I think it'd require going into the code to tweak this.

The reason for the behavior being the way it is is that much of the time, sequences haven't been trimmed at all before handing them to hmmalign (full-length protein sequences, for example, for single-domain proteins), so you have to allow flanking nonhomology or even more terrible things happen, even more of the time.

btski commented 3 years ago

My use case involves finding relative gene positions using regular expressions (suffix trees might be better, but regex is sufficient for now). I can usually find start/stop position (independent of start/stop codon) and track reading frame changes with incremental matching. I use hmmalign to find: 1) start/stop position if regex is insufficient and 2) indel position whenever reading frame changes. I work with highly diverse virus species and hmmalign is absolutely essential for alignment.

For process 1, I crop roughly ~10% past the expected start/end point prior to hmmalign and rely on trimming to establish the "measured" start/stop. Current glocal implementation works well. Process 2 I create a small, truncated HMM profile and align to a matching subset of the query sequence. In this case, flanking insertions should not be allowed, since alignment is known to be fully homologous (global/global preferable). I currently correct by combining unmatched, lower case residues with gaps to produce a forced match.

It becomes challenging when the query is near complete (might be a little longer or shorter than the start/stop position). Trimming these near complete queries becomes ambiguous; all the information I have to work with is the number of gaps to unmatched residues (when flanking insertions occur, which is pretty frequent). Additional context would be helpful for deciding whether trimming is necessary. I could "guesstimate" the alignment score. Using my original example: 567899*** implies that the un-aligned "c" would have a score of roughly 4. This could help me determine scope on convertible un-aligned residues (limited process 2 treatment).