DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
683 stars 266 forks source link

Low minimizer count in prebuilt NT Database leading to unclassified reads #841

Open remondulos opened 2 weeks ago

remondulos commented 2 weeks ago

I am using the prebuilt nt Database (source) to map paired reads with Kraken2. The output shows many unclassified reads for a specific genome (Clostridium phage C2, NCBI Link ) However, when using a custom-built database (bacterial + viral + the custom genome from NCBI), many reads match this species.

Using kraken2-inspect, I noticed that the prebuilt NT database has only one minimizer mapping for this species: 0.00 1 1 S 2999103 Clostridium phage C2

In contrast, my custom-built database has 16830 minimizers mapped to the same species: 0.00 16830 16830 S 2999103 Clostridium phage C2

This is the report from the NT database:

 90.51  3793657 18286   R   1   root
 69.67  2919920 1862    D   10239     Viruses
 69.58  2916377 0   D1  2731341     Duplodnaviria
 69.58  2916377 0   K   2731360       Heunggongvirae
 69.58  2916377 0   P   2731618         Uroviricota
 69.58  2916377 82  C   2731619           Caudoviricetes
 69.55  2915125 2021266 C1  2788787             unclassified Caudoviricetes
  0.00  3   3   S   2999103               Clostridium phage C2

And this is the report from the custom database:

 86.87  3640865 985 R   1   root
 66.93  2805234 247 D   10239     Viruses
 66.92  2804659 0   D1  2731341     Duplodnaviria
 66.92  2804659 7   K   2731360       Heunggongvirae
 66.91  2804480 0   P   2731618         Uroviricota
 66.91  2804480 6   C   2731619           Caudoviricetes
 66.88  2803116 0   C1  2788787             unclassified Caudoviricetes
 66.87  2802792 2802792 S   2999103               Clostridium phage C2

Could the low number of minimizers be causing the many unclassified reads for this genome when using the NT database? Also, why is there only one minimizer mapping to the genome?