EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
317 stars 70 forks source link

nhmmer model parameterization errors caused by highly repetitive sequence #198

Closed traviswheeler closed 4 years ago

traviswheeler commented 4 years ago

First a prelude

nhmmer's analysis pipeline largely follows the flow of phmmer/hmmsearch:

One place where nhmmer significantly diverges from the protein analogs is in handling of composition bias: rather than simply compute the bias due to composition, nhmmer adjusts the background distribution, adjusts the om model accordingly, and re-runs Forward/Backward with the new params. This adjusts for composition bias, and concurrently cuts down on homologous overextension due to that bias.

This step background adjustment step happens in rescore_isolated_domain(), and depends on calls to reparameterize_model(). The first call to reparameterize_model() updates the background model bg from the original (call it A) to a new background (call it B) based on the composition of the aligned envelope, and stores the original A in a temporary variable provided by the calling function. A later call to reparameterize_model() reverts bg to A based on that temporary variable.

Now the problem

Unfortunately, when a target sequence is highly repetitive, it can cause the p7_Decoding() call within rescore_isolated_domain() to return eslERANGE (numeric overflow). rescore_isolated_domain() then returns an error code to the calling p7_pipeline function, which handles this case gracefully ... but the current code fails to revert the bg model to its original state A. The result is that all later alignments are performed with a model that is parameterized based on a (wrong, and highly-biased) background B. This can stack - if there are several occurrences, they can lead to a very skewed bg.

The result is two-fold:

I'm attaching files to reproduce here: (note: in both cases, I added the .txt extension so github would let me upload them):

To see a side effect of the problem, you can run nhmmer, and see that a few more nucleotides pass the Forward filter than would if you ran each target sequence separately.

% nhmmer --crick --cpu=0 DR0184904.hmm 25seqs.fa
...
Residues passing Fwd filter:            5266  (0.142); expected (3e-05)
(the total should be 4091 residues)

Bypassing composition-bias reparameterization makes the problem go away

% nhmmer --nonull2 --crick --cpu=0 DR0184904.hmm 25seqs.cons
...
Residues passing Fwd filter:            4091  (0.111); expected (3e-05)

This error is obviously a serious problem, since it can lead to incorrect results (also really really long run time). It caused a fair amount of grief in a recent Dfam release cycle, almost certainly because a large number of long and artificially perfect tandem repeats made it into an all-against-all alignment framework.

I've tested several repetitive models against the human genome, and haven't seen the eslERANGE error pop up in those tests, so I doubt that the impact is widespread - the error seems to crop up only in severe cases of repetitiveness. Even so, it should be resolved soon - surely this might be affecting someone, somewhere.

The solution is straightforward, and I'm able to submit a PR. Before I do that: would you prefer that I wait until you finalize PR#194, or that I just go ahead with a new PR while that one is under consideration? The changes will have no overlap with the changes in that PR.

cryptogenomicon commented 4 years ago

This looks like a serious problem. Go ahead and submit a PR.