EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

Stop phmmer search after N significant hits have been found. #264

Closed davidjakubec closed 2 years ago

davidjakubec commented 2 years ago

Dear HMMER team,

I'm wondering if it would be possible or practical to add an option to phmmer, which would stop the search after a specified number of significant hits N have been found. This would be very useful when one is working with large target databases (e.g., UniProtKB), but subsamples the resulting MSAs downstream in their analyses. Unfortunately, I don't know C or the HMMER codebase well enough to tell this from looking at src/phmmer.c, but perhaps you might know quickly whether this could work.

Best, David

cryptogenomicon commented 2 years ago

I don't think we'll want to do it that way, because the subsample you'd get (from early stopping) would be biased by the arbitrary order of the target sequence database. We do plan to have HMMER4 do downsampling in some other ways, though.

davidjakubec commented 2 years ago

Dear professor,

my idea was to pre-shuffle the target sequence database once and perform the searches with my queries on these sequences, stopping early if the threshold is reached. I know this is not the same as drafting a new subsample from the output MSA for each individual query (e.g., two similar query sequences will likely share the first several hits and thus yield the same MSA if the threshold is low), but the resulting MSAs would still be sufficiently random for my application. Unfortunately, I now spend most of my time aligning sequences which I end up discarding.

David

npcarter commented 2 years ago

The problem with stopping a search early is that HMMER can't determine which hits are significant until it knows how big the database is, which it can only (currently) do by searching the database and counting the number of the sequences. This is a limitation of the FASTA file format, so we can't do much about it. When searching a database, HMMER computes a bit-score for each sequence, and then converts that into a preliminary e-value using the number of sequences searched so far as the database size. After that pass, it goes through and re-computes the e-value of each sequence/HMM comparison using the full database size. The effect is that sequences early in the database have a very easy time meeting the e-value threshold to be reported during the first pass, but often do not make the reporting threshold when the full database size is used to compute the e-value.

If we took your suggestion and stopped searching after finding the first N hits, we'd have to use the preliminary e-values, and there'd be a very high chance that many of those N hits would have dropped below the reporting threshold during the final e-value computation, so we'd be returning bad hits to the user.

Parsing large numbers of hits can be challenging. One suggestion would be to use the --tblout argument to generate a tabular version of the results, process that to determine which hits you want to include in the alignment, and then extract the details of those hits from the full output.

Hope this helps,

-Nick

On Thu, Nov 18, 2021 at 2:16 PM David Jakubec @.***> wrote:

Dear professor,

my idea was to pre-shuffle the target sequence database once and perform the searches with my queries on these sequences, stopping early if the threshold is reached. I know this is not the same as drafting a new subsample from the output MSA for each individual query (e.g., two similar query sequences will likely share the first several hits and thus yield the same MSA if the threshold is low), but the resulting MSAs would still be sufficiently random for my application. Unfortunately, I now spend most of my time aligning sequences which I end up discarding.

David

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/issues/264#issuecomment-973178491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZGQ7T2J7TRVIJHMKRTUMVGKRANCNFSM5IJD2NDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

davidjakubec commented 2 years ago

Thank you for the detailed answer. I thought that the size of the target database (which I know) could be specified with the -Z option, but it seems that I've misunderstood the argument. It appears that what I imagined is probably more complicated than what I'd thought. Please feel free to close the issue if you feel nothing more can be done.

Parsing the MSAs is really not much of an issue; I use the Easel tools to do the actual file content manipulation, and my Python scripts to do the rest.