Open pfeiferd opened 1 year ago
Hello, thank you for your report. We looked at the issue, and I believe the software works properly. The output has two columns containing k-mer counts. the "kmers" column refers to the number of distinct k-mers. The following column "dup" is the duplication ratio. The total number of the classified k-mers is the produce of these two columns 104*1,97=204.88 ~205.
As an experiment, I tried duplicating the last read. Here is my output. The number of distinct k-mers did not change, but the duplication ratio went up: % reads taxReads kmers dup cov taxID rank taxName 100 3 0 104 2.97 2.963e-07 1 no rank root 100 3 0 104 2.97 2.963e-07 10239 superkingdom Viruses 100 3 0 104 2.97 2.163e-06 2559587 clade Riboviria 100 3 0 104 2.97 2.661e-06 2732396 kingdom Orthornavirae 100 3 0 104 2.97 9.088e-06 2732408 phylum Pisuviricota 100 3 0 104 2.97 1.286e-05 2732506 class Pisoniviricetes 100 3 0 104 2.97 3.039e-05 76804 order Nidovirales 100 3 0 104 2.97 5.454e-05 2499399 suborder Cornidovirineae 100 3 0 104 2.97 5.454e-05 11118 family Coronaviridae 100 3 0 104 2.97 5.724e-05 2501931 subfamily Orthocoronavirinae 100 3 0 104 2.97 0.0002109 694002 genus Betacoronavirus 100 3 0 104 2.97 0.001188 2509511 subgenus Sarbecovirus 100 3 0 104 2.97 0.001784 694009 species Severe acute respiratory syndrome-related coronavirus 100 3 3 104 2.97 0.003583 2697049 no rank Severe acute respiratory syndrome coronavirus 2
Thank you - great info. Sorry for my false report and the misunderstanding. Thanks as well for the quick answer.
Dear krakenuniq-team,
1) Take the two reads from below (at the end of this issue-report) an put them in a fastq file (lets call the file "/mnt/covid/fastqs/error.fastq"). The file contains two reads which can be assigned to SAR-Cov-2.
2) Run krakenuniq as follows (or correspondingly):
krakenuniq --exact --report-file kuout.csv --threads 8 -db /mnt/m2/kuniqdb/kuniq_standard_plus_eupath_minus_kdb /mnt/covid/fastqs/error.fastq
Then then kraken1 part of krakenuniq produces the following (correct) classification output on the console:
C A01245:144:HMV7FDSX3:1:2217:5963:30452/1 2697049 151 2697049:101 0:20 C A01246:144:HMV7FDSX3:1:2217:5963:30452/2 2697049 148 0:14 2697049:104
So in total, there are 205 kmers that belong to 2697049.
3) BUT the derived report file from krakenuniq counts only 104 kmers. It seems to miss the kmers from the first read entirely. This is the corresponding content of the report file ("kuout.csv" from above):
% reads taxReads kmers dup cov taxID rank taxName 100 2 0 104 1.97 3.029e-09 1 no rank root 100 2 0 104 1.97 4.136e-07 10239 superkingdom Viruses 100 2 0 104 1.97 2.882e-06 2559587 clade Riboviria 100 2 0 104 1.97 3.825e-06 2732396 kingdom Orthornavirae 100 2 0 104 1.97 1.131e-05 2732408 phylum Pisuviricota 100 2 0 104 1.97 1.569e-05 2732506 class Pisoniviricetes 100 2 0 104 1.97 3.471e-05 76804 order Nidovirales 100 2 0 104 1.97 6.111e-05 2499399 suborder Cornidovirineae 100 2 0 104 1.97 6.111e-05 11118 family Coronaviridae 100 2 0 104 1.97 6.451e-05 2501931 subfamily Orthocoronavirinae 100 2 0 104 1.97 0.000212 694002 genus Betacoronavirus 100 2 0 104 1.97 0.001191 2509511 subgenus Sarbecovirus 100 2 0 104 1.97 0.001788 694009 species Severe acute respiratory syndrome-related coronavirus 100 2 2 104 1.97 0.003586 2697049 no rank Severe acute respiratory syndrome coronavirus 2
Given the HIGH RELEVANCE of the issure in terms of result quality, please answer to this issue asap and fix the potential bug...
Thanks and best regards, Daniel
@A01245:144:HMV7FDSX3:1:2217:5963:30452/1 CAGCAACACAGTTGCTGATTCTCTTCCTGTTCCAAGCATAAACAGATGCAAATCTGGTGGCGTTAAAAACTTCACCAAAAGGGCACAAGTTTGTAATATTAGGAAATCTAACAATAGATTCTGTTGGTTGGTCTATAAAGTTAGAAGTGTG + FFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFF:FFFFFF:F::FF:FFF:F:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF::FFFF:FF,F,FFFFFFF:FFFFF:FFF,:FF,:,,,FFF,FFF,F::FF,FFFFFF,::,,,F @A01246:144:HMV7FDSX3:1:2217:5963:30452/2 ACTTCTAACTTTATAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFF,FF:FFFFFFFFFFFFF,F,FFFFFFFF:FFFFFFFFF,FF