DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
714 stars 271 forks source link

full taxa naming for kraken output #738

Open mmcoff opened 1 year ago

mmcoff commented 1 year ago

I am using the following code to run kraken2 with the standard 67GB database:
kraken2 --db /data/coffmanm/tools/krakenBig --threads 10 --confidence 0.05 --output krakenOut/${base}.output.txt --report krakenOut/${base}.report.txt --use-mpa-style --report-zero-counts --gzip-compressed --use-names --paired ${base}_R1_001.trimmed.fastq.gz ${base}_R2_001.trimmed.fastq.gz

In the report output, some of the taxa names are incomplete (i.e., dBacteria|cDeltaproteobacteria|gDissulfurimicrobium would ideally contain |pProteobacteria). Is there a way to edit the code so that the full taxa name is displayed in the report?

palatinate commented 1 year ago

If you check the ncbi taxonomic entry for Dissulfurimicrobium there is no entry for Proteobacteria. There is the level of clade "delta/epsilon subdivisions" not reported by kraken2.

What you could do (without changing the code) is to download this file from NCBI : https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/ It contains a file named fullnamelineage.dmp and than perform a merge using the taxonomic ids (eg. 9606) in both files. However, this won't give you the rank of the taxonomic level.

rankedlineage.dmp would give you the rank, but also doesn't contain certain taxonomic levels, like clade "delta/epsilon subdivisions"