`jackhmmer` returns "broken" homologs?

ayaanhossain commented 1 year ago

Hello @cryptogenomicon, big fan of hmmer, especially jackhmmer. Recently, while searching for some multi-domain protein homologs against standard databases, we have seen the following outcome from jackhmmer with default depth of 5 (no params altered from default).

seed            --AAAAAAAAA---GGGG---
homologA/split1 --AGAAAAAAA----------
homologA/split2 --------------GAGG---
homologB/split1 --AGAAAAAAA----------
homologB/split2 ----------A-AAGAGG---

At first, we chalked this down to domain-wise matches within a larger homolog, potentially with repeats and whatnot.

seed       --AAAAAAAAA---GGGG---
homologA   --AGAAAAAAA---------- ... x residues later ... --------------GAGG---
homologB   --AGAAAAAAA---------- ... y residues later ... ----------A-AAGAGG---

However, on closer inspection, we noticed that the split coordinates were overlapping. For example, if homologA/splitA ranged from (15, 23), the homologA/splitB would span (27, 30). Ideally this should have been a single match like the following.

seed           --AAAAAAAAA---GGGG---
homologA/split --AGAAAAAAA---GAGG---

What was even more baffling was the result in homologB, where not only the coordinates overlapped, but also the match itself was split.

Our guess is this is due to the default parameter we are using, and we'd appreciate if you let us know how we could mitigate this? The standard reference databases we're searching against are free of duplicate entries, so we're unsure where this is coming from. Could it be a product of the iterative searching that jackhmmer does?

We've seen very identical matches missing in just a single round of jackhmmer vs the regular iteration count of 5 too, no idea why though given the high similarity.

cryptogenomicon commented 1 year ago

With multihit local alignment, there's no crisp difference between calling two homologous regions (two local alignments to the profile, separated by entirely nonhomologous sequence) vs calling one homologous region with an insertion (or more rarely, deletion, in some weirder cases). The longer the insertion, the more the inference will err toward splitting the annotation into two local alignments.

Coordinates of two inferred local alignments to the same target sequence can indeed occasionally overlap in HMMER3. This reflects the uncertainty of whether a particular residue is best assigned to one versus the other. You'll see that uncertainty in low posterior probability (alignment confidence) annotation on any such residue.

These things are an inherent property of the multihit local alignment probability model itself - HMMER's profile HMM design - not particular to jackhmmer.

There should be ways to improve this. There's "chaining" algorithms, used in genome/genome alignment to stitch a bunch of local alignments together into a larger consistent linear alignment. However, as far as I've been able to see so far, I don't see a way of adapting a profile HMM architecture to what those algorithms do; they implicitly depend on more contextual information than a hidden Markov model has available to it.

ayaanhossain commented 1 year ago

Thanks so much Dr. Eddy, this helps us understand a lot! The multiple local alignment hits and the splitting now makes sense to me. You referred to chaining algorithms that can potentially stitch split alignments and reassemble them again. I'm guessing not, but is there perhaps a tool or script known to you that we can apply to the hits output by jackhmmer to regain the larger alignments? Right now, our thought is to loop over the results in a FASTA format (converted from the Stockholm output from jackhmmer), and if the coordinates overlap with respect to the MSA length, try to merge them together into the form I described earlier. That is, convert these

seed            --AAAAAAAAA---GGGG---
homologA/split1 --AGAAAAAAA----------
homologA/split2 --------------GAGG---

into this.

seed           --AAAAAAAAA---GGGG---
homologA/split --AGAAAAAAA---GAGG---

But, it would be awesome if you knew of a better way to do this or refer us in some direction. Otherwise, this has been really helpful, and we can close this issue too. Thanks so much Sean!

cryptogenomicon commented 1 year ago

Sorry, I don't know of such a program. I don't think it's an easy problem to distinguish two domains from one, so I don't think a simple script that pushes all cases to a single domain annotation is going to solve the problem. Your approach will create the opposite error: it will fuse cases where there really are two different domains into a single incorrect domain. HMMER is trying to resolve it probabilistically, at least; it does have a probability model to try to distinguish the two cases, rather than always resolving it in one direction, even if I don't think the probability model is always right.

ayaanhossain commented 1 year ago

Yeah, bit of a conundrum on our end.

EddyRivasLab / hmmer

`jackhmmer` returns "broken" homologs? #308