DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
714 stars 271 forks source link

kraken2 2.1.3 empty taxonomic level field in k2report #759

Open karlkashofer opened 1 year ago

karlkashofer commented 1 year ago

Using the standard 16gb library and some bacterial reads we seem to get empty taxonomic level fields in the kraken2 report file. For example, the root and bacteria lines have no taxonomic level.

bracken does not like that... Are we doing something wrong ?

devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ docker run -it -v $PWD:/analysis kraken2 bash -c "cd /analysis; kraken2 --db ./databases/standard_16gb --threads 4 --report D7055-2.k2report --report-minimizer-data --minimum-hit-groups 3 --gzip-compressed D7055-2.fasta.gz > D7055-2.kraken2"
Loading database information... done.
100000 sequences (12.56 Mbp) processed in 0.424s (14137.4 Kseq/m, 1774.97 Mbp/m).
  78744 sequences classified (78.74%)
  21256 sequences unclassified (21.26%)
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ head -n 20 D7055-2.k2report 
 21.26  21256   21256   0       0       U       0       unclassified
 78.74  78744   1580    523338  339180          1       root
 77.16  77164   0       509838  331368  1       131567    cellular organisms
 77.16  77160   177     509352  330978          2           Bacteria
 76.96  76964   2182    497689  323733  P       1224          Pseudomonadota
 73.50  73497   504     458068  294546  C       1236            Gammaproteobacteria
 69.57  69566   5       427073  267411  O       72274             Pseudomonadales
 69.56  69561   962     426828  267031  F       135621              Pseudomonadaceae
 68.60  68597   54681   409859  254097  G       286                   Pseudomonas
 13.79  13794   820     41188   25693   G1      136841                  Pseudomonas aeruginosa group
 12.96  12960   12889   36636   22788   S       287                       Pseudomonas aeruginosa
  0.01  10      2       12      10      S1      208964                      Pseudomonas aeruginosa PAO1
  0.01  7       7       9       7       S2      1147787                       Pseudomonas aeruginosa PAO1H2O
  0.00  1       1       1       1       S2      1367494                       Pseudomonas aeruginosa PAO1-VE13
  0.01  7       7       12      10      S1      1009714                     Pseudomonas aeruginosa PAK
  0.01  5       5       10      8       S1      381754                      Pseudomonas aeruginosa PA7
  0.01  5       5       10      10      S1      1400868                     Pseudomonas aeruginosa VRFPA04
  0.00  4       4       7       7       S1      1457392                     Pseudomonas aeruginosa PA96
  0.00  4       4       6       6       S1      1448140                     Pseudomonas aeruginosa YL84
  0.00  4       4       4       4       S1      1415629                     Pseudomonas aeruginosa MTB-1
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ docker run -it -v $PWD:/analysis kraken2 bash -c "cd /analysis; kraken2 --version"
Kraken version 2.1.3
Copyright 2013-2023, Derrick Wood (dwood@cs.jhu.edu)
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline/databases/standard_16gb$ head ktaxonomy.tsv 
1       |       1       |       R       |       0       |       root
10239   |       1       |       D       |       1       |       Viruses
131567  |       1       |       R1      |       1       |       cellular organisms
2787854 |       1       |       R1      |       1       |       other entries
10472   |       10239   |       F       |       2       |       Plasmaviridae
10474   |       10239   |       F       |       2       |       Fuselloviridae
karlkashofer commented 1 year ago

It seems to be restricted to toplevel domains:

karl@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ cat D7055-2.k2report | grep -P '\t\t'
 78.74  78744   1580    523338  339180          1       root
 77.16  77160   177     509352  330978          2           Bacteria
  0.00  4       0       49      48              2759        Eukaryota
karl@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ 

Manually adding the correct R and D flags fixes the problem, bracken is happy.

I'd still like to know why this happens.

lmolokin commented 1 year ago

I've also run into the issue of report files missing rank codes for root and domains. Just recently when using the nt database and also previously when using a custom database after combining PlusPFP and my own sequence via --add-to-library.

For the nt database, its inspect.txt file shows all rank codes are present. Only the reports show them as missing.

Command I used for the runs:

kraken2 \
--confidence 0.1 \
--minimum-hit-groups 3 \
--gzip-compressed \
--db ${db} \
--threads ${SLURM_CPUS_ON_NODE} \
--output ${SLURM_JOB_NAME}.out \
--report ${SLURM_JOB_NAME}.report \
--classified-out ${SLURM_JOB_NAME}_class.fastq \
--unclassified-out ${SLURM_JOB_NAME}_unclass.fastq \
${reads}
yeemey commented 11 months ago

I just ran into the same issue of missing rank codes with Kraken version 2.1.2, and both k2_pluspf_20220908 and k2_pluspf_20230605 db.

In my kraken reports, "D" is missing, but subdomain ranks ("D1") show up. "R" is also missing, and the subrank partially shows up ("1" instead of "R1"). This caused issues for me when using KrakenTools/extract_kraken_reads.py with the --include-children flag. I resolved this by manually adding the missing rank codes.

My command:

kraken2 \
--db k2_pluspf_20220908 \
--gzip-compressed \
--threads 32 \
--minimum-hit-groups 3 \
--report-minimizer-data \
--output k2db_20220908/hits.txt \
--report k2db_20220908/report.txt \
--paired \
S18_R1.fastq.gz \
S18_R2.fastq.gz
timtimbruno commented 9 months ago

Ugg, if you have hundreds of files to fix the following script might be useful. It replaces the missing classification of root, Bacteria, and Eukaryota of all kreports in the CWD with -,D,D respectively. Modify for your given database accordingly.

for val in *.kreport

do

fix=$(echo "$val"|sed -r 's/^(.+).kreport/\1_corrected.kreport/g') sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)(root)$/\1\2\3-\t\5\6/g' "$val" | \ sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\s+)(Bacteria)$/\1\2\3D\t\5\6/g' | \ sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]*\s+)(Eukaryota)$/\1\2\3D\t\5\6/g' > "$fix"

done

jenniferlu717 commented 9 months ago

@karlkashofer I have noticed this issue recently and am currently investigating.

I'll try to update soon. We are aware of the issue and are trying to address it.