DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
707 stars 271 forks source link

--report-minimizer-data; distinct minimizer exceeds inspect minimizer #445

Open NienkeMekkes opened 3 years ago

NienkeMekkes commented 3 years ago

Dear authors,

The new --report minimizer-data is a very promising feature! I do have a question about it. When I run kraken2-inspect on my database, I find one column which is: "amount of database minimizers that map to a taxon rooted in this clade". When I run kraken2 with --report-minimizer-data, I find that the estimate in the distinct minimizer column can be higher than this inspect value. I expected that the inspect value would be the maximum number of distinct minimizers that you can find at that clade. Why is this not the case?

Thanks

For example; in my database ~300.000 minimizers are rooted at S bacteroides fragilis. In my kraken2 output, I found 1.370.00 distinct minimizers for S bacteroides fragilis.

kdbchau commented 3 years ago

What is the command you are using?

NienkeMekkes commented 3 years ago

For running Kraken2, I typically use: kraken2 reads/ --db krakendb --paired --output sampleID_kraken_output.txt --report sampleID_kraken_report.txt --report-minimizer-data. The mentioned row for bacteroides fragilis looks like:

20.62 458469 442261 21635904 1372051 S 817 Bacteroides fragilis

For kraken2-inspect, I use: kraken2-inspect --db krakendb. The mentioned row for bacteroides fragilis

0.03 302714 290736 S 817 Bacteroides fragilis

mihkelvaher commented 3 years ago

Seems like a duplicate of #392

phspo commented 1 year ago

can confirm this, reading the source code is a bit confusing since the option is referred to as "report kmer data" vs minimizer, maybe the number is indeed the number of assigned k-mers? or does it maybe also count distinct minimizers even if they don't belong to the taxon a read was assigned to?

as a suggestion it could also be helpful to output minimizers/unique minimizers at node level in addition to the subtree rooted at a specific node (this can be calculated from the subtree or bottom up for the entire tree obviously).

phspo commented 1 year ago

looks like this would be any minimizer found in the read even if it's not matching the taxon that gets assigned as the final classification?

https://github.com/DerrickWood/kraken2/blob/29d49c44a0aab83acc8af8f14cf72cdc36228dca/src/classify.cc#L547