nhmmer's analysis pipeline largely follows the flow of phmmer/hmmsearch:
ungapped viterbi filter
gapped viterbi filter
Forward filter + Backward
"domain" definition (which serves as hit separation in nhmmer)
and so on
One place where nhmmer significantly diverges from the protein analogs is in handling of composition bias: rather than simply compute the bias due to composition, nhmmer adjusts the background distribution, adjusts the om model accordingly, and re-runs Forward/Backward with the new params. This adjusts for composition bias, and concurrently cuts down on homologous overextension due to that bias.
This step background adjustment step happens in rescore_isolated_domain(), and depends on calls to reparameterize_model(). The first call to reparameterize_model() updates the background model bg from the original (call it A) to a new background (call it B) based on the composition of the aligned envelope, and stores the original A in a temporary variable provided by the calling function. A later call to reparameterize_model() reverts bg to A based on that temporary variable.
Now the problem
Unfortunately, when a target sequence is highly repetitive, it can cause the p7_Decoding() call within rescore_isolated_domain() to return eslERANGE (numeric overflow). rescore_isolated_domain() then returns an error code to the calling p7_pipeline function, which handles this case gracefully ... but the current code fails to revert the bg model to its original state A. The result is that all later alignments are performed with a model that is parameterized based on a (wrong, and highly-biased) background B. This can stack - if there are several occurrences, they can lead to a very skewed bg.
The result is two-fold:
all alignments after a p7_Decoding() eslERANGE status will be wrong, period.
with sufficiently-skewed bg (which can happen when several eslERANGE results stack), essentially all sequences in the target db pass through ungapped, gapped, and even Forward filters. This means things get very slow.
I'm attaching files to reproduce here: (note: in both cases, I added the .txt extension so github would let me upload them):
DR0184904.hmm.txt : a single family hmm (a transposable element family)
25seqs.fa.txt: a file with 25 sequences. One of them (DR0179595) is just one great big ATG tandem repeat (>4000 nt long, only 331 substitutions from a perfect ATG repeat, along with 23 indels). This sequence induces a p7_Decoding() eslERANGE return, and causes downstream changes to model parameterization.
To see a side effect of the problem, you can run nhmmer, and see that a few more nucleotides pass the Forward filter than would if you ran each target sequence separately.
% nhmmer --crick --cpu=0 DR0184904.hmm 25seqs.fa
...
Residues passing Fwd filter: 5266 (0.142); expected (3e-05)
(the total should be 4091 residues)
Bypassing composition-bias reparameterization makes the problem go away
This error is obviously a serious problem, since it can lead to incorrect results (also really really long run time). It caused a fair amount of grief in a recent Dfam release cycle, almost certainly because a large number of long and artificially perfect tandem repeats made it into an all-against-all alignment framework.
I've tested several repetitive models against the human genome, and haven't seen the eslERANGE error pop up in those tests, so I doubt that the impact is widespread - the error seems to crop up only in severe cases of repetitiveness. Even so, it should be resolved soon - surely this might be affecting someone, somewhere.
The solution is straightforward, and I'm able to submit a PR. Before I do that: would you prefer that I wait until you finalize PR#194, or that I just go ahead with a new PR while that one is under consideration? The changes will have no overlap with the changes in that PR.
First a prelude
nhmmer's analysis pipeline largely follows the flow of phmmer/hmmsearch:
One place where nhmmer significantly diverges from the protein analogs is in handling of composition bias: rather than simply compute the bias due to composition, nhmmer adjusts the background distribution, adjusts the
om
model accordingly, and re-runs Forward/Backward with the new params. This adjusts for composition bias, and concurrently cuts down on homologous overextension due to that bias.This step background adjustment step happens in rescore_isolated_domain(), and depends on calls to reparameterize_model(). The first call to reparameterize_model() updates the background model
bg
from the original (call it A) to a new background (call it B) based on the composition of the aligned envelope, and stores the original A in a temporary variable provided by the calling function. A later call to reparameterize_model() revertsbg
to A based on that temporary variable.Now the problem
Unfortunately, when a target sequence is highly repetitive, it can cause the p7_Decoding() call within rescore_isolated_domain() to return eslERANGE (numeric overflow). rescore_isolated_domain() then returns an error code to the calling p7_pipeline function, which handles this case gracefully ... but the current code fails to revert the
bg
model to its original state A. The result is that all later alignments are performed with a model that is parameterized based on a (wrong, and highly-biased) background B. This can stack - if there are several occurrences, they can lead to a very skewedbg
.The result is two-fold:
bg
(which can happen when several eslERANGE results stack), essentially all sequences in the target db pass through ungapped, gapped, and even Forward filters. This means things get very slow.I'm attaching files to reproduce here: (note: in both cases, I added the .txt extension so github would let me upload them):
To see a side effect of the problem, you can run
nhmmer
, and see that a few more nucleotides pass the Forward filter than would if you ran each target sequence separately.Bypassing composition-bias reparameterization makes the problem go away
This error is obviously a serious problem, since it can lead to incorrect results (also really really long run time). It caused a fair amount of grief in a recent Dfam release cycle, almost certainly because a large number of long and artificially perfect tandem repeats made it into an all-against-all alignment framework.
I've tested several repetitive models against the human genome, and haven't seen the eslERANGE error pop up in those tests, so I doubt that the impact is widespread - the error seems to crop up only in severe cases of repetitiveness. Even so, it should be resolved soon - surely this might be affecting someone, somewhere.
The solution is straightforward, and I'm able to submit a PR. Before I do that: would you prefer that I wait until you finalize PR#194, or that I just go ahead with a new PR while that one is under consideration? The changes will have no overlap with the changes in that PR.