DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
714 stars 271 forks source link

mmscanner NextMinimizer method minimum calculation #793

Open eric9n opened 8 months ago

eric9n commented 8 months ago

https://github.com/DerrickWood/kraken2/blob/acc22481bbeb32b870c12b403ce595fe7a6db770/src/mmscanner.cc#L181C35-L181C35

When the first k characters of seq are all fuzzy characters, if "strpos >= (sizet) k)" exits the loop, will it cause there to be no multiple candidate_lmers in the queue for comparison, and directly return the first candidate_lmer instead of the smallest one?

` // Sliding window minimum calculation while (! queue.empty() && queue.back().candidate > candidatelmer) queue.pop_back(); MinimizerData data = { candidate_lmer, queuepos }; if (queue_.empty() && queuepos >= k - l) { // Empty queue means front will change // Minimizer will change iff we've processed enough l-mers changedminimizer = true; } queue.pushback(data); // expire an l-mer not in the current window if (queue.front().pos < queuepos - k + l) { queue.erase(queue.begin()); // Change in front means minimizer changed changed_minimizer = true; }

// Change from no minimizer (beginning of sequence/near ambig. char)
if (queue_pos_ == k_ - l_)
  changed_minimizer = true;
queue_pos_++;

// Return only if we've read in at least one k-mer's worth of chars
if (str_pos_ >= (size_t) k_) {
  break;
}`
eric9n commented 8 months ago
// Return only if we've read in at least one k-mer's worth of chars
if (str_pos_ >= (size_t) k_) {
  break;
}

If the above if statement is true, it may occur that after k characters, each time the candidatelmer obtained will only be compared with the first value in queue. If the value in queue_ is smaller, then actually changed_minimizer is false, which means that the minimizer value has not changed. At this point, the NextMinimizer function returns the old value, which is not very meaningful for library construction or classification calculations. Instead, it repeats calculations and wastes computing resources.

eric9n commented 8 months ago

For example, the output result in the test file mmtest.cc

#include "kraken2_headers.h"
#include "mmscanner.h"

using namespace std;
using namespace kraken2;

int main() {
  string seq = "ACGATCGACGACG";
  MinimizerScanner scanner(10, 5, 0);
  scanner.LoadSequence(seq);
  uint64_t *mmp;
  while ((mmp = scanner.NextMinimizer()) != nullptr)
    printf("%016x\n", *mmp);
  return 0;
}
$ ./mmtest
00000000000002d8
0000000000000218
0000000000000218
0000000000000218

The last three lines are duplicate data, which has no impact on the mapping result of taxid and only increases the computational load.

eric9n commented 4 months ago

When such a scenario occurs, the same minimum sequence can result in multiple hits, leading to an increase in false positives.