Fix for large loss in hits using nhmmer w/ FM #172

traviswheeler commented 3 years ago

This addresses https://github.com/EddyRivasLab/hmmer/issues/172.

Refresher: nhmmer can be used to search for query sequences in a target database by either (a) directly searching the target sequences with the standard HMMER filter pipeline, or (b) first creating a binary index of the target (using makehmmerdb), then using that index to find high-scoring ungapped seeds that are then passed through the remaining pipeline.

The problem was that (b) was producing substantially fewer hits than (a), and many of the missed hits were high-quality matches e.g. E-value < 1e-40.

The issue was raised by Nick Carter, after a routine search for homologs in a dataset of Saccharomycotina scaffolds. The error isn't unique to this dataset, but I was able to reproduce by searching with a set of query sequences provided by Nick against assembled scaffolds from: https://mycocosm.jgi.doe.gov/saccharomycotina/saccharomycotina.info.html (The JGI scaffold file is ~1.3Gb; I will upload the query file in the pull request that follows this commit).

With a saccharomycotina dataset downloaded 01/15/2021 (89 scaffolds), (a) produced 255 matches, while (b) produced only 195. One of the hits from (a) that was missed by (b) had an E-value of 4.2e-71. Yuck.

I initially thought the missing hits were the result of a bug in the FM index implementation used as the basis of the binary index. After much inspection, I couldn't find such an FM-index bug, and (finally) realized that each of the missed hits really does fail to pass the precise filtering criteria used in the seed-finding stage (or the seed-extension stage that immediately follows).

A refresher on that pipeline:

A seed involves some region of the target sequence aligned against some part of the query, and must reach a target score S within some specified length L. By default, that was S=15 bits within a length L=15 ungapped run.
The space of possible seeds is explored by enumerating all length-L strings, growing strings to produce the enumeration. During this enumeration, substrings are pruned (not extended to reach length L) if they fail to meet particular conditions:
- at some length L'<L, it's clear that there is no possible extension of the growing seed that can possibly reach the score S by length L.
- at L', their score density is below some threshold (0.8 bits/position),
- and a few others ad hoc filtering criteria.

(1) During development of the FM-accelerated pipeline for nhmmer, testing was performed with HMM queries, not single-sequence queries. The per-position scores produced by nhmmer for single-sequence queries are pretty low, so it is quite likely (depending on the specific letters) that a length-15 ungapped alignment with 2 mismatches will not reach 15 bits. In other words, the seeding mechanism almost requires seeds in which 14 of 15 positions are identical. That's too stringent. Changing the cutoff to "14 bits" recovers a large number of the missed hits, at <10% increase in run time. (note, they usually still need to have 13/15) I'm hesitant to go lower than that, and speed really starts to drop off.

(2) Even so, aligned regions with mismatch(es) in the middle sometimes reduce score density enough to cause the match to be filtered by the score density criteria. These regions look like this: taccagctaggtaat taccag t ggtaat TACCAGTTGGGTAAT The density filter is an integral part of the speed of the indexing approach, so can't be removed ... but by slightly loosening it (to a default of 0.75), we recover more lost matches.

(3) Finally, regions like the one above also highlighted the failings of an additional (ill-advised) density-based pruning condition that I've now removed entirely. After a certain length of candidate seed (e.g. length=11 by default), the growing seed was required to have score density at least as high as the final target density (previously S/L = 15/15 = 1.0). I thought this was a safe extra filter because the index considers seeds in both directions, and I believed that a miss in the forward pass would mean there must be sufficient density in the reverse pass to ensure survival. This failed to allow for the fact that an alignment with a single mismatch in the middle will have a density dip in both directions. In that case, both directions will get pruned before picking up the score to recover the match (even if the density from (2) above is met). I've removed this ad hoc filter; it has little appreciable impact on speed, but allows recovery of several previously-missed matches.

(4) Finally, I found that several matches were missed because nhmmer takes an identified seed, and extends it to an ungapped (SSV) alignment of length at most 70 - if the target SSV score isn't reached by then, the comparison is quit. By extending this value to 100, several additional hits were recovered.

So in total:

I've removed one ill-advised seed pruning condition
I've reduced the default score threshold (--seed_sc_thresh) to 14.0
I've reduced the default req'd score density (--seed_sc_density) to 0.75
I've increased the default ssv diagonal length (--seed_ssv_length) to 100

The result is that the FM-accelerated nhmmer run time has increased a bot over 10%, and 247 hits are found in the above dataset. Most of those still missed hits are good scoring ones (e.g. E-value 1e-14), but not great. In one case the FM-accelerated pipeline binds together a hit that was split into two hits by the standard pipeline. In a single other case, a very good-scoring hit (1e-40) is missed by the FM-based pipeline - that match's alignment has the perfect pathological form (no good runs of nearly all identity) to flummox the seed stage as it is currently designed. That's true for all the lesser hits as well. They should clearly be caught, but doing so will require a change to the seed stage so that it either seeks multiple shorter matches or uses spaced seed methods. We are currently working on these, so I think this will eventually be addressed ... but for now, it's the best nhmmer can do.

traviswheeler commented 3 years ago

I'm attaching the query file mentioned above - it's a fasta file, but github doesn't like the extension .fa: query.txt

traviswheeler commented 3 years ago

Final thought: the changes here are mostly things that can be controlled by command-line flags, except for the one removed pruning condition from fm_ssv.c. Even so, I recommend that anyone using nhmmer's binary database (as produced by makehmmerdb) should shift to using these default parameters unless they have a specific willingness to trade lost matches for speed.

cryptogenomicon commented 3 years ago

Sounds great. @npcarter, could you have a look at this and test it out?

npcarter commented 3 years ago

I'm getting segfaults when I try to test the hmmerdb version of the test database. GDB reports the problem as "Thread 3 "nhmmer" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x2aaaad008700 (LWP 131843)] 0x000000000043b014 in fm_convertRange2DSQ (fm=0x7356a0, meta=0x736260, first=0, length=-353124137, complementarity=1, sq=0x2aaab42b34c0, fix_ambiguities=0) at fm_general.c:340 340 sq->dsq[i-first+1] = (fm->T[i/4] >> ( 0x6 - ((i&0x3)*2) ) & 0x3);"

Original command was "src/nhmmer --cpu=2 --tblout ~/travis_fix_test_db.tblout ~/travis_query.txt ~/89x_JGI_Saccharomycotina-jan21.db > ~/travis_fix_test_db.out"

This is on a CentOS machine with 128GB of RAM, and also on a Ubuntu machine with 32GB of RAM. I originally thought the problem might be that the laptop was running out of memory, but top doesn't show the CentOS machine using significant amounts of RAM before it crashes.

traviswheeler commented 3 years ago

Sadly, I can reproduce.

On a couple of centos machines here (3.10.0-693.5.2.el7.x86_64 and 3.10.0-1160.31.1.el7.x86_64, both gcc 9.2.0), I see a seg fault with --cpu=2 or 4 after ./configure --enable-debugging ... but only on one of the machines (3.10.0-693.5.2.el7.x86_64) with just ./configure. I also don't get a seg fault under other thread counts I've tested, or on other machines (Mac and Ubuntu ... details irrelevant). I'm sure that's a matter of illegal memory access pattern that gets (un)lucky under some conditions.

This error also crops up in the develop branch - I've reproduced it there, with different combinations of settings. I don't believe the segfault is at all tied to this PR (its changes don't touch any memory-access code). So: I think it's best to open up a new issue for the seg fault. I'll do that shortly, then submit a distinct PR for that when I figure out a fix. When that one gets resolved, then I'll ask to return to this one.

traviswheeler commented 3 years ago

@npcarter: the most recent commit (32a86b4) resolves the seg fault you identified. I believe you should be able to proceed with testing as before, which should confirm that both iss #257 and iss #172 are resolved.

npcarter commented 3 years ago

The new version looks good for functionality. Valgrind is reporting a few memory errors in both makehmmerdb and nhmmer when run on a db file, so once those get cleaned up, we should be set to merge.

traviswheeler commented 3 years ago

@npcarter : those two commits resolve the valgrind errors I've seen in several tests, so it should be ready for you to test again.

With a large database and long query (e.g. 5000 bases), my valgrind give several warnings like "Warning: set address range perms: large range". As I understand it, these are valgrind's way of warning you that the program has made a really large allocation - in this case, those appear to just be large allocations for the DP matrix.

npcarter commented 3 years ago

Ok, will do. Yes, the messages you're describing are Valgrind diagnostics, not error messages, so don't indicate a problem.

-Nick

On Sun, Oct 17, 2021 at 7:35 PM Travis Wheeler @.***> wrote:

@npcarter https://github.com/npcarter : those two commits resolve the valgrind errors I've seen in several tests, so it should be ready for you to test again.

With a large database and long query (e.g. 5000 bases), my valgrind give several warnings like "Warning: set address range perms: large range". As I understand it, these are valgrind's way of warning you that the program has made a really large allocation - in this case, those appear to just be large allocations for the DP matrix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/253#issuecomment-945217462, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZEXBI2EWW6YQ2C4G2LUHNMT7ANCNFSM5CRDMQYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

npcarter commented 3 years ago

This looks good to go. Valgrind is happy with makehmmerdb and nhmmer. Searching with a database version of a fasta file misses a few hits that are found when searching the original fasta file, but well within the description of makehmmerdb as trading some sensitivity for speed. Merging.

EddyRivasLab / hmmer

Fix for large loss in hits using nhmmer w/ FM #172 #253