Closed rosscrowhurst closed 7 months ago
seems happening to every genome i have tested... problems with Kraken2
According to kraken2 the following contigs are classified as bacterial:
CONTIG Size Key NCBI Blastn vs nt (discontinuous megablast) best hit species
contig_257_np12 2477 P Ilex aquifolium genome assembly, chromosome: 8
contig_538_np12 112238 P Ilex aquifolium genome assembly, chromosome: 1
contig_1022_np12 2095 P Gossypioides kirkii chromosome KI_11
contig_259_np12 1185 P Gossypioides kirkii
contig_537_np12 1163 P Gossypium herbaceum
contig_69_np12 1004 P Gossypioides kirkii
contig_68_np12 1001 N No significant hit
contig_65_np12 546 N No significant hit
contig_861_np12 4855 N No significant hit
contig_254_np12 1614 N No significant hit
contig_119_np12 9447 B Klebsiella quasipneumoniae
contig_988_np12* 2477 B Klebsiella quasipneumoniae
- N = No significant hit
- B = Bacterial
- P = Plant
* fcs_gx reported this as viral
For the sequences classified by kraken2 as bacterial:
Kraken2 classification success rate = 16.67%
Kraken2 classification failure rate = 83.33%
According to kraken2 the following 23 contigs were classified as having no hits:
CONTIG Size Key NCBI Blastn vs nt (discontinuous megablast) best hit species
contig_138_np12 6919 N No significant hit
contig_422_np12 3997 P Fraxinus pennsylvanica genome assembly, chromosome: 22
contig_184_np12 3222 P Buxus sempervirens genome assembly, chromosome: 4
contig_79_np12 2689 N No significant hit
contig_955_np12 2050 P Camellia sinensis small nucleolar RNA Z101 (LOC114308532), ncRNA
contig_432_np12 1991 P Camellia sinensis small nucleolar RNA Z101 (LOC114308532), ncRNA
contig_400_np12 1586 N No significant hit
contig_388_np12 1543 N No significant hit
contig_418_np12 1461 N No significant hit
contig_407_np12 1287 N No significant hit
contig_185_np12 1189 N No significant hit
contig_236_np12 1185 P Actinidia chinensis DNA, Y-specific genomic marker third one
contig_864_np12 1004 N No significant hit
contig_58_np12 955 P Buxus sempervirens genome assembly, chromosome: 4
contig_322_np12 892 N No significant hit
contig_865_np12 865 N No significant hit
contig_285_np12 739 N No significant hit
contig_325_np12 575 N No significant hit
contig_713_np12 558 P Actinidia chinensis var. chinensis cultivar 4x chromosome 1 mitochondrion, complete sequence
contig_711_np12 553 P Actinidia chinensis var. chinensis cultivar 4x chromosome 1 mitochondrion
contig_756_np12 531 P Actinidia chinensis DNA, Y-specific genomic marker eighth one
contig_937_np12 496 P Gossypioides kirkii chromosome KI_2_4
contig_57_np12 496 I Ophion luteus genome assembly, chromosome: 7
KEY:
- N = No significant hit
- B = Bacterial
- I = Insect
- P = Plant
For the sequences classified by kraken2 as no hit:
Kraken2 classification success rate = 52%
Kraken2 classification failure rate = 48%
@GallVp Maybe the database kraken2 is using is not inclusive of all sequences it should have?
@GallVp - incidentally if the small contigs were culled most of the anomalies would not be detected so I wonder if Kraken2 was set up and tested just with chromosome level sequences
// To select a DB, see https://benlangmead.github.io/aws-indexes/k2
// The pipeline automatically downloads the required DB if needed
//
// Using PlusPFP: archaea, viral, plasmid, human, UniVec_Core, protozoa, fungi & plant
db_url = "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20230314.tar.gz"
We are using their largest database: PlusPFP (108 G)
Possible reason for Homo sapiens - I took a couple of random contigs that Kraken2 claasifies as "homo sapiens' and I blasted them using BLASTn against a older copy of Genbank Genomes databases I have locally. So far 3 randomly selected contigs are returning similar results and hitting Homo sapiens genome but hits are very short regions (40 ish bases) and are all microsatellite like sequences:
Such hits are meaningless in this type of context. A possible way to improve Kraken2 spurious classification would be to run it on hard masked genome where the hypervariable microsatellite like sequences are masked with N. So suggestion is before Kraken2- run hard masking of microsatellites then run Kraken2 on this microsatellite masked genome sequence.
@GallVp @CeciliaDeng @christinawu2008
Kraken2 using a K-mer-based algorithm. Maybe someone has investigated the accuracy of its results against the k-mer size parameter.
We maybe able to create a taxonomic classification plot using krona from NCBI FCS GX results.
Assessment of CK6901M in assembly_qc with kraken turned on identified 62 contigs as "homo sapiens'. Random selection of 5 contigs BLASTed into NCBI returned no hit or time out for 3 and homology to Actinidia for 2. The ones with no hit should have been in Kraken2's No Hit cluster. The ones with homology to Actinidia should not have been identified as 'Homo sapiens'.
Will check more but this throws doubt on kraken2 as a reliable tool and a warning should be included about its questionable results (or it should be removed?)