DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
683 stars 266 forks source link

Only eukaryote reads are being classified with nr database #821

Open timghaly opened 2 months ago

timghaly commented 2 months ago

Hey kraken2 team, thanks for this tool.

I am trying to classify reads using the nr database, but am finding ~99% of reads are unclassified, and only eukaryote hits (~1%) are being classified.

I built the nr database using the following command:

unset OMP_NUM_THREADS

kraken2-build \ --build \ --protein \ --db nr \ --threads 48 \

I am then attempting to classify with the following:

kraken2 \ --threads 48 \ --use-names \ --report "kraken2/G2677.k2_nr_report.txt" \ --db ~/databases/kraken2/nr \ --gzip-compressed \ --paired \ "fastp/G2677.trimmed.R1.fq.gz" "fastp/G2677.trimmed.R2.fq.gz" \

"kraken2/G2677.k2_nr_out.txt"

I have also run kraken2 on a ONT metagenome with the same command (except without the --paired argument), but am also getting the same results.

I am presuming something went wrong during the kraken2-build stage. Any help would be greatly appreciated.

timghaly commented 2 months ago

Sorry, I forgot to include the commands that I used for downloading the taxonomy and reference library before building the kraken2 nr database were the following:

kraken2-build --download-taxonomy --db nr --threads 1

and

kraken2-build --download-library nr --protein --db nr --threads 1

Ayala-Ruan-CesarM commented 2 months ago

@timghaly have you classify your samples with a less complex database? Maybe use refseq indexes and see if the percentage of unclassify reads are the same. Additionally to diagnosticate if the database building had any errors should be worth run the command "kraken2-inspect --db /path_to_your_db" and see what is actually there.

callAgene commented 1 month ago

Hahaha, I just had the exact same problem as you, and after bothering me for a week I found out why, and an extremely stuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuupid reason.!!!!!!!!!!!!!!!!!!

In the index folder you created, you will see that your hash.k2d file is only 137GB, but unmapped.txt is a whopping 12GB.

So the problem is obvious.

You need to add the --protein option to taxonomy when downloading (Example: kraken2-build --download-taxonomy --db nr --threads 1 --protein ).

After adding --protein, the software downloads prot.accession2taxid.gz instead of nucl_gb.accession2taxid and nucl_wgs.accession2taxid.

timghaly commented 1 month ago

Thanks @callAgene , that is exactly what has happened. Thanks for pointing that out for me!