DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
702 stars 269 forks source link

Missing species in inspect file after database build #853

Open ksavhughes opened 1 month ago

ksavhughes commented 1 month ago

Hi Kraken2 developers/community,

I recently built a large Kraken2 database with genomes from the NCBI RefSeq database. I added genomes regardless of assembly level and limited it to 1 assembly per species. But after doing some testing, I discovered that there seemed to be some missing species from the database. Can someone tell me why this is?

I wanted to make sure everything was added correctly after the build, so I ran the inspect command and then compared the taxids in the seqid2taxid.map file to the ones in the inspect file. And there are 897 taxids in the seqid2taxid.map file that were not present in the inspect file. I did some digging and it seems like those species were not added because they didn't have unique minimizers. Can anyone confirm this?

This is the number of species missing per NCBI division: 598 Bacteria 37 Invertebrates 16 Phages 206 Plants and Fungi 1 Rodents 36 Vertebrates 3 Viruses

Notes from further investigating particular missing species

slw287r commented 1 month ago

You can rerun kraken2's inspect subcommand with the option --report-zero-counts to output those taxids without unique minimizers along others to the inspect file.