fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
224 stars 44 forks source link

Blatant miscount of kmers in krakenuniq report file #145

Open pfeiferd opened 1 year ago

pfeiferd commented 1 year ago

Dear krakenuniq-team,

1) Take the two reads from below (at the end of this issue-report) an put them in a fastq file (lets call the file "/mnt/covid/fastqs/error.fastq"). The file contains two reads which can be assigned to SAR-Cov-2.

2) Run krakenuniq as follows (or correspondingly):

krakenuniq --exact --report-file kuout.csv --threads 8 -db /mnt/m2/kuniqdb/kuniq_standard_plus_eupath_minus_kdb /mnt/covid/fastqs/error.fastq

Then then kraken1 part of krakenuniq produces the following (correct) classification output on the console:

C A01245:144:HMV7FDSX3:1:2217:5963:30452/1 2697049 151 2697049:101 0:20 C A01246:144:HMV7FDSX3:1:2217:5963:30452/2 2697049 148 0:14 2697049:104

So in total, there are 205 kmers that belong to 2697049.

3) BUT the derived report file from krakenuniq counts only 104 kmers. It seems to miss the kmers from the first read entirely. This is the corresponding content of the report file ("kuout.csv" from above):

% reads taxReads kmers dup cov taxID rank taxName 100 2 0 104 1.97 3.029e-09 1 no rank root 100 2 0 104 1.97 4.136e-07 10239 superkingdom Viruses 100 2 0 104 1.97 2.882e-06 2559587 clade Riboviria 100 2 0 104 1.97 3.825e-06 2732396 kingdom Orthornavirae 100 2 0 104 1.97 1.131e-05 2732408 phylum Pisuviricota 100 2 0 104 1.97 1.569e-05 2732506 class Pisoniviricetes 100 2 0 104 1.97 3.471e-05 76804 order Nidovirales 100 2 0 104 1.97 6.111e-05 2499399 suborder Cornidovirineae 100 2 0 104 1.97 6.111e-05 11118 family Coronaviridae 100 2 0 104 1.97 6.451e-05 2501931 subfamily Orthocoronavirinae 100 2 0 104 1.97 0.000212 694002 genus Betacoronavirus 100 2 0 104 1.97 0.001191 2509511 subgenus Sarbecovirus 100 2 0 104 1.97 0.001788 694009 species Severe acute respiratory syndrome-related coronavirus 100 2 2 104 1.97 0.003586 2697049 no rank Severe acute respiratory syndrome coronavirus 2

Given the HIGH RELEVANCE of the issure in terms of result quality, please answer to this issue asap and fix the potential bug...

Thanks and best regards, Daniel


@A01245:144:HMV7FDSX3:1:2217:5963:30452/1 CAGCAACACAGTTGCTGATTCTCTTCCTGTTCCAAGCATAAACAGATGCAAATCTGGTGGCGTTAAAAACTTCACCAAAAGGGCACAAGTTTGTAATATTAGGAAATCTAACAATAGATTCTGTTGGTTGGTCTATAAAGTTAGAAGTGTG + FFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFF:FFFFFF:F::FF:FFF:F:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF::FFFF:FF,F,FFFFFFF:FFFFF:FFF,:FF,:,,,FFF,FFF,F::FF,FFFFFF,::,,,F @A01246:144:HMV7FDSX3:1:2217:5963:30452/2 ACTTCTAACTTTATAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFF,FF:FFFFFFFFFFFFF,F,FFFFFFFF:FFFFFFFFF,FF

alekseyzimin commented 1 year ago

Hello, thank you for your report. We looked at the issue, and I believe the software works properly. The output has two columns containing k-mer counts. the "kmers" column refers to the number of distinct k-mers. The following column "dup" is the duplication ratio. The total number of the classified k-mers is the produce of these two columns 104*1,97=204.88 ~205.

alekseyzimin commented 1 year ago

As an experiment, I tried duplicating the last read. Here is my output. The number of distinct k-mers did not change, but the duplication ratio went up: % reads taxReads kmers dup cov taxID rank taxName 100 3 0 104 2.97 2.963e-07 1 no rank root 100 3 0 104 2.97 2.963e-07 10239 superkingdom Viruses 100 3 0 104 2.97 2.163e-06 2559587 clade Riboviria 100 3 0 104 2.97 2.661e-06 2732396 kingdom Orthornavirae 100 3 0 104 2.97 9.088e-06 2732408 phylum Pisuviricota 100 3 0 104 2.97 1.286e-05 2732506 class Pisoniviricetes 100 3 0 104 2.97 3.039e-05 76804 order Nidovirales 100 3 0 104 2.97 5.454e-05 2499399 suborder Cornidovirineae 100 3 0 104 2.97 5.454e-05 11118 family Coronaviridae 100 3 0 104 2.97 5.724e-05 2501931 subfamily Orthocoronavirinae 100 3 0 104 2.97 0.0002109 694002 genus Betacoronavirus 100 3 0 104 2.97 0.001188 2509511 subgenus Sarbecovirus 100 3 0 104 2.97 0.001784 694009 species Severe acute respiratory syndrome-related coronavirus 100 3 3 104 2.97 0.003583 2697049 no rank Severe acute respiratory syndrome coronavirus 2

pfeiferd commented 1 year ago

Thank you - great info. Sorry for my false report and the misunderstanding. Thanks as well for the quick answer.