I recently built a large Kraken2 database with genomes from the NCBI RefSeq database. I added genomes regardless of assembly level and limited it to 1 assembly per species. But after doing some testing, I discovered that there seemed to be some missing species from the database. Can someone tell me why this is?
I wanted to make sure everything was added correctly after the build, so I ran the inspect command and then compared the taxids in the seqid2taxid.map file to the ones in the inspect file. And there are 897 taxids in the seqid2taxid.map file that were not present in the inspect file. I did some digging and it seems like those species were not added because they didn't have unique minimizers. Can anyone confirm this?
This is the number of species missing per NCBI division:
598 Bacteria
37 Invertebrates
16 Phages
206 Plants and Fungi
1 Rodents
36 Vertebrates
3 Viruses
Notes from further investigating particular missing species
Invertebrates - Acropora genus
6 missing species in genus (all mitos)
16 species in database (2 wg 14 mito)
Rodents - Mus musculus domesticus
added to seqid2taxid.map properly
missing 1 mito (domesticus)
mito in inspect file with least amount of minimizers = 19 minimizers
In DB: Mus musculus has wg and 3 mitos from subspecies other than domesticus
Vertebrates:
Kali and Beta genus
kept 1 mitogenome from 1 species in genus (Betta - also 1 wg)
Some hybrids removed
Vipera berus - mito removed - no other genomes in genus, but other genomes in family
You can rerun kraken2's inspect subcommand with the option --report-zero-counts to output those taxids without unique minimizers along others to the inspect file.
Hi Kraken2 developers/community,
I recently built a large Kraken2 database with genomes from the NCBI RefSeq database. I added genomes regardless of assembly level and limited it to 1 assembly per species. But after doing some testing, I discovered that there seemed to be some missing species from the database. Can someone tell me why this is?
I wanted to make sure everything was added correctly after the build, so I ran the inspect command and then compared the taxids in the seqid2taxid.map file to the ones in the inspect file. And there are 897 taxids in the seqid2taxid.map file that were not present in the inspect file. I did some digging and it seems like those species were not added because they didn't have unique minimizers. Can anyone confirm this?
This is the number of species missing per NCBI division: 598 Bacteria 37 Invertebrates 16 Phages 206 Plants and Fungi 1 Rodents 36 Vertebrates 3 Viruses
Notes from further investigating particular missing species