Closed Augustin-Zidek closed 4 months ago
I haven't tried to reproduce the situation (and am not sure where to find rnacentral_24_0_clustered_rep_seq.fasta, so I don't know how big it is), but I can start with a couple observations
How long are the other sequences you're searching with? Are there actually 2KB queries that take only a few minutes to run? Do these two sequences both take 8+ hours, or is it just sequence 2?
Thanks for taking a look.
I haven't tried to reproduce the situation (and am not sure where to find rnacentral_24_0_clustered_rep_seq.fasta, so I don't know how big it is), but I can start with a couple observations
Ah, sorry. This db is just 13 GB (13271 Mbases) and produced from the vanilla RNACentral FASTA using MMseqs2:
mmseqs2 easy-linclust \
rnacentral.fasta \
/tmp/out.fasta \
/tmp/mmseqs_tmp \
-c 0.9 \
--min-seq-id 0.9 \
--cov-mode 0
But we saw this slow behavior being consistent across other dbs (i.e. NT RNA) as well with the same problematic sequences.
How long are the other sequences you're searching with?
This problem typically occurs with longer RNA sequences.
Are there actually 2KB queries that take only a few minutes to run?
AFAIK yes, but I will need to double check. But my experience is that the extremely long runtime correlates with the number of hits, not so much with the query length. Of course, the longer the query sequence, the longer the runtime, but the 8+ hour runtime happens only when the number of hits is massive.
Do these two sequences both take 8+ hours, or is it just sequence 2?
Yes, both, and below is another one that is even more extreme and takes 24+ hours.
ACAGACCUGAGUGUGGCAGGACUACCCGCCGAACUUAAGCAUAUUACUCAGCGGAGGAAAAGAAAACAACCGUGAUUCUCUCAGUGAGCGGCGAGCGAAGAGGGAACGAACUCGUUGCCGAAUCGGGUCUUACAAGGCCUUGAGUUGUGGCACAAAUAUCAUUGUGUGGUUGGUGCAGGAGUGGCGAAAUUCAAGCGAAAAGCAAACCCGUUGCUGAAUACAACCCUUCAUGUGAGUAUUGAGCCAAAGAAGGUGUUAGCCCAUUGAGCCAUAAACCUGAGCGCCUCAUGAACUGUAUUAAGCGAGUAGCACUGUUUGGGAAUGCAGUGUCAAGUUGGCAGGUAUUUGUCUGCCAAAGUUAAAUACAGAGUAGGAAGACCGAUAGCGAACAAGUAGCGUGAGCGAAAGUUUGAAAAGCACUUUGGAAAGAGAGUGAAAUAGAACCUGAAGUCGUGACAAAGACAAUGGAAGUACCUCCAUUUCAACGGCCAGAACCAGCCCUACUGCAUUAUCCUGUCGGGGAUCAAAAUCCGGCGGGCAACGAAGUGCAAGAAUCUUGGGUCUUGGUGAAAUGAGCGGAGUAUUUGAGCUGCGCACGGCGCAAGCCGCACGCACUGGUUUCUUCUCCGUGAUGCUCCAUGUUGGCGAUGGAAAUGGGGUGCCCCCACCCGUCUUGAAACACGGACCAAGGAGUCAAACAGACGCGCAAGGGAGGAGAUAUGGUUGCCAGUACUUCUCUCGUACUGAAAGGGAGAUGCAAAAUGCAUCUGUGGUGUGCACACGUCAAAUCCGUUUGAUUUGCCCAACACCGACCGGCCCUCCGUGGGUUGCGUUGGAGUGUGCCUGUUUGGACCCGAAAGGUGGUGAACUAUGCCUGAACAAGGUGAAGCCCGGGGAAACCCAGGUGGAGGCCUGUAGCGAUGCUGACGUGCAAAUCGCUCGUAUGAUUUGGGUAUCGGAGCGAAAGACUCAUCGAACCACCUAGUAGCUGGUUCACAUCGAAGUUUCCCUCAGGAUAGCUGGUGCUAGUAGAAGUAUUGUGCGGUAAAGCAAAUGAUUAGAGGCAUUGGUGUUCUUAGAAUAUCGACCUAUUCUCAAACUUUAAAUGUGCAAACAAACCGUGCCUUAGCCAACUGCAAGGCCGGAAAGAAAAGAAUUAGAGGCACCAAGUGGGCCUUCUUCGGUAAGCGGAGCAGGCGAUGCGGAAUGAACCGCUUAGGGAUAUGUGGCGUCAAAACUUAUGGGCUCAUUCUCGAUACGUGAAAAGGGUGUUGGUGGAUAUGGACAGUUGGACGGUGGCCAUGGAAGUCGGCAUCCGCUAAGGAGUGUGUAACAACUCACCAACCGAAUCCACCGGCCCCGAAAAUGGAUGACGCUUAAGCCCAAUUAGUGAUUGCCCAUUAUUCCUUUGGUAAGGGCGGAAACAAAUCUUGGAGAAAGUGUAGCCCUUCCGAUGUAGAUGUGGCCUGGAGGUUAGGACGAAGCUUAUGGCGUGAGCCUAAGAUGGACCGGCCUCUAGUGCAGAUCUUGGUUGGCGUAGCAAAGAUCUAACGGAGAUAUACUCAACAUGCAACGUUGGAUACUGGAGCGGGGAAGGAUUUCGUGCCAACGGCACUCGUACACGAGUUGUUCGGAAACUGAGCACAACGUUAUAUCGUUUUGUUAGGAAAGUGAAGGUGUGUCGGCGGCAGCUGCUCUCGGGUAGCUGUUCGACUGGCCAUAACUGAAAAGGGGCAACAGAGAACCUGGGAUUAUAUUCCAAAAAGAAAUUGCAUGUGGGAA
Thanks for the clarifications.
the 8+ hour runtime happens only when the number of hits is massive.
I'm confident that this is the answer to your question. The three sequences you provided are all rRNA subunits. Your target database is RNA Central, which is swamped with rRNAs ... and your clustering is only going to remove some of them because of the high --min-seq-id and -c cutoffs. The result, as you say, will be a "massive" number of matches. HMMER doesn't have a mechanism for cutting calculations short once it's found "enough" matches (whatever that means), so nhmmer is just going to keep on churning through all those alignments. Since they're both long and numerous, the run time is going to be large.
The upshot: this (almost certainly) isn't a bug. In fact, it's something that RNA Central deals with, too: https://rnacentral.org/help/sequence-search#rrna (I'm pretty sure they described the process in more detail in a paper, but I don't have the time to hunt it down).
The key idea is to work around the problem by removing all rRNA from the dataset, then performing two searches: one against the remaining sequences in the set, and another against a much smaller set of rRNA sequences (or maybe just models, e.g. with http://eddylab.org/software/ssu-align/). We're moving out of my area of expertise, but I bet @nawrockie may have better guidance.
(Note: sure, there are possible solutions within the search software itself ... but these don't exist in nhmmer)
Thanks for the database pre-filtering tip, Travis, we will give that a try.
We are seeing Nhmmer running incredibly slowly (8+ hours with an in-RAM database on a powerful machine) on certain sequences against RNA Central, ant NT RNA.
Both examples are from recently added PDB entries.
Example 1
Example 2
The exact command is:
For comparison, other queries take 10s of minutes. Are we doing anything wrong, or are these sequences producing too many hits, or could this indicate a performance issue in Nhmmer?
Thanks!