I am using the prebuilt nt Database (source) to map paired reads with Kraken2. The output shows many unclassified reads for a specific genome (Clostridium phage C2, NCBI Link )
However, when using a custom-built database (bacterial + viral + the custom genome from NCBI), many reads match this species.
Using kraken2-inspect, I noticed that the prebuilt NT database has only one minimizer mapping for this species:
0.00 1 1 S 2999103 Clostridium phage C2
In contrast, my custom-built database has 16830 minimizers mapped to the same species:
0.00 16830 16830 S 2999103 Clostridium phage C2
This is the report from the NT database:
90.51 3793657 18286 R 1 root
69.67 2919920 1862 D 10239 Viruses
69.58 2916377 0 D1 2731341 Duplodnaviria
69.58 2916377 0 K 2731360 Heunggongvirae
69.58 2916377 0 P 2731618 Uroviricota
69.58 2916377 82 C 2731619 Caudoviricetes
69.55 2915125 2021266 C1 2788787 unclassified Caudoviricetes
0.00 3 3 S 2999103 Clostridium phage C2
And this is the report from the custom database:
86.87 3640865 985 R 1 root
66.93 2805234 247 D 10239 Viruses
66.92 2804659 0 D1 2731341 Duplodnaviria
66.92 2804659 7 K 2731360 Heunggongvirae
66.91 2804480 0 P 2731618 Uroviricota
66.91 2804480 6 C 2731619 Caudoviricetes
66.88 2803116 0 C1 2788787 unclassified Caudoviricetes
66.87 2802792 2802792 S 2999103 Clostridium phage C2
Could the low number of minimizers be causing the many unclassified reads for this genome when using the NT database? Also, why is there only one minimizer mapping to the genome?
I am using the prebuilt nt Database (source) to map paired reads with Kraken2. The output shows many unclassified reads for a specific genome (Clostridium phage C2, NCBI Link ) However, when using a custom-built database (bacterial + viral + the custom genome from NCBI), many reads match this species.
Using kraken2-inspect, I noticed that the prebuilt NT database has only one minimizer mapping for this species:
0.00 1 1 S 2999103 Clostridium phage C2
In contrast, my custom-built database has 16830 minimizers mapped to the same species:
0.00 16830 16830 S 2999103 Clostridium phage C2
This is the report from the NT database:
And this is the report from the custom database:
Could the low number of minimizers be causing the many unclassified reads for this genome when using the NT database? Also, why is there only one minimizer mapping to the genome?