DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 266 forks source link

how many genomes have been included in kraken2 database #749

Closed ZhangDengwei closed 6 months ago

ZhangDengwei commented 10 months ago

Hi,

I have downloaded and built the kraken2 database with kraken2-build --download-taxonomy, as follows

kraken2-build --db krakendb --download-library bacteria 
kraken2-build --db krakendb --download-library archaea 
kraken2-build --db krakendb --download-library viral 
kraken2-build --db krakendb --download-library fungi 

Take the bacteria as an example, following files have been generated in the bacteria folder

-r--r--r--. 1 dwzhang dwzhang     81768478 Nov 11  2022 assembly_summary.txt
-rw-rw-r--. 1 dwzhang dwzhang     15676553 Jun 13 16:55 library.dict
lrwxrwxrwx. 1 dwzhang dwzhang           11 Jun 13 16:33 library.fa -> library.fna
-rw-rw-r--. 1 dwzhang dwzhang      4776687 Jun 13 16:47 library.fa.fai
-rw-rw-r--. 1 dwzhang dwzhang 146797204613 Nov 12  2022 library.fna
-rw-rw-r--. 1 dwzhang dwzhang            0 Nov 12  2022 library.fna.masked
-rw-rw-r--. 1 dwzhang dwzhang      3179159 Nov 11  2022 manifest.txt
-rw-rw-r--. 1 dwzhang dwzhang      3692818 Nov 12  2022 prelim_map.txt

I have gone through some previous posts and reviewed the kraken2 paper, and I found that only complete genomes would be downloaded this way. A total of 264,821 genomes are included in the assembly_summary.txt, but only 29,967 are "Complete Genome". I understand that the draft genome might be contaminated as noticed in the kraken paper, I wonder whether the only inclusion of the complete genome would influence the taxonomical annotation a lot as some bacteria are still uncultured, especially in the human gut genome.

Besides, I checked the manifest.txt which contains 34,573 genomes. I wonder whether the downloaded genomes are those in the manifest.txt. If so, I found some genomes are "Chromosome" instead of "Complete Genome", for instance

GCF_001278805.1 PRJNA224116     SAMN04009150            na      1705566 1705566 Bacillus sp. FJAT-18017 strain=FJAT-18017               latest  Chromosome      Major   Full    2015/09/04      ASM127880v1     Fujian Academy of Agricultural Sciences    GCA_001278805.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/278/805/GCF_001278805.1_ASM127880v1                    na

Lastly, I attempted to count the genomes in library.fna, but one genome might contain multiple contigs, making me hard to count the genome number directly.

jenniferlu717 commented 6 months ago

Only complete genomes are included by default. If you want to download non-complete genomes, you can try using the krakenuniq-download scripts: https://github.com/fbreitwieser/krakenuniq

However, yes, we only suggest using complete genomes due to contamination. The genome representation is sufficient for most sample sets. If something is in your sample that is not represented, you will likely get a large fraction of unclassified reads.