how many genomes have been included in kraken2 database

Hi,

I have downloaded and built the kraken2 database with kraken2-build --download-taxonomy, as follows

kraken2-build --db krakendb --download-library bacteria 
kraken2-build --db krakendb --download-library archaea 
kraken2-build --db krakendb --download-library viral 
kraken2-build --db krakendb --download-library fungi

Take the bacteria as an example, following files have been generated in the bacteria folder

-r--r--r--. 1 dwzhang dwzhang     81768478 Nov 11  2022 assembly_summary.txt
-rw-rw-r--. 1 dwzhang dwzhang     15676553 Jun 13 16:55 library.dict
lrwxrwxrwx. 1 dwzhang dwzhang           11 Jun 13 16:33 library.fa -> library.fna
-rw-rw-r--. 1 dwzhang dwzhang      4776687 Jun 13 16:47 library.fa.fai
-rw-rw-r--. 1 dwzhang dwzhang 146797204613 Nov 12  2022 library.fna
-rw-rw-r--. 1 dwzhang dwzhang            0 Nov 12  2022 library.fna.masked
-rw-rw-r--. 1 dwzhang dwzhang      3179159 Nov 11  2022 manifest.txt
-rw-rw-r--. 1 dwzhang dwzhang      3692818 Nov 12  2022 prelim_map.txt

I have gone through some previous posts and reviewed the kraken2 paper, and I found that only complete genomes would be downloaded this way. A total of 264,821 genomes are included in the assembly_summary.txt, but only 29,967 are "Complete Genome". I understand that the draft genome might be contaminated as noticed in the kraken paper, I wonder whether the only inclusion of the complete genome would influence the taxonomical annotation a lot as some bacteria are still uncultured, especially in the human gut genome.

Besides, I checked the manifest.txt which contains 34,573 genomes. I wonder whether the downloaded genomes are those in the manifest.txt. If so, I found some genomes are "Chromosome" instead of "Complete Genome", for instance

GCF_001278805.1 PRJNA224116     SAMN04009150            na      1705566 1705566 Bacillus sp. FJAT-18017 strain=FJAT-18017               latest  Chromosome      Major   Full    2015/09/04      ASM127880v1     Fujian Academy of Agricultural Sciences    GCA_001278805.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/278/805/GCF_001278805.1_ASM127880v1                    na

Lastly, I attempted to count the genomes in library.fna, but one genome might contain multiple contigs, making me hard to count the genome number directly.

DerrickWood / kraken2

how many genomes have been included in kraken2 database #749