Confusion on the algorithm used in database building?

DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system

MIT License

701 stars 268 forks source link

Confusion on the algorithm used in database building? #626

Open tuituit opened 2 years ago

tuituit commented 2 years ago

Dear Kraken2 developers, I'm using your excellent kraken2 software in my metagenomics project and have read through your paper on kraken2. And I am really confused about its database building principle. Below is my doubt:

Why does kraken2 use a minimizer instead of full-length k-mer for hashing ?
What is the role of spaced seed mask and toggle mask?
Considering collision rate, is a 15-bit compact hash code too short for large sequencing data which may contain millions of kmers doesn't actually exist in the database? Will a 32-bit hash code reduce the collision rate?

Looking forward to a reply, many many thanks!

jenniferlu717 commented 2 years ago

I don't have answers to all of your questions but Kraken2 was built as an improvement on Kraken1. Kraken 1 stored/used full kmers, but for hashing, we use only the minimizers to limit the size.

Spaced seeds/toggled masking are used to prevent bias towards low complexity sequences. More information is found in: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46

tuituit commented 2 years ago

@jenniferlu717 Thanks for your reply! As per Kraken's paper, it uses minimizer and spaced seeds/toggled masking to generate uniform distributed minimizers and accelerate search. But for Kraken2, it stores hash code instead of kmers, much different from Kraken. While hash code is mathematically uniform, are these tricks(like minimizer(why not using the leading 32 bases(64 bits) for hashing) and spaced seeds/toggled masking) useful anymore? In a word, Kraken2 compresses kmer to a 15-bit hash code, therefore, reducing storage space, which is the main difference from the traditional hashmap. Am I right？