Plant-Food-Research-Open / assemblyqc

A Nextflow pipeline for evaluating assembly quality
https://plant-food-research-open.github.io/assemblyqc/
MIT License
27 stars 3 forks source link

Include a warning about kraken2 reliability? #58

Closed rosscrowhurst closed 7 months ago

rosscrowhurst commented 1 year ago

Assessment of CK6901M in assembly_qc with kraken turned on identified 62 contigs as "homo sapiens'. Random selection of 5 contigs BLASTed into NCBI returned no hit or time out for 3 and homology to Actinidia for 2. The ones with no hit should have been in Kraken2's No Hit cluster. The ones with homology to Actinidia should not have been identified as 'Homo sapiens'.

Will check more but this throws doubt on kraken2 as a reliable tool and a warning should be included about its questionable results (or it should be removed?)

rosscrowhurst commented 1 year ago

https://doi.org/10.1099/mgen.0.000949

christinawu2008 commented 1 year ago

seems happening to every genome i have tested... problems with Kraken2

rosscrowhurst commented 1 year ago

According to kraken2 the following contigs are classified as bacterial:

CONTIG              Size     Key   NCBI Blastn vs nt (discontinuous megablast) best hit species
contig_257_np12     2477     P     Ilex aquifolium genome assembly, chromosome: 8
contig_538_np12     112238   P     Ilex aquifolium genome assembly, chromosome: 1
contig_1022_np12    2095     P     Gossypioides kirkii chromosome KI_11
contig_259_np12     1185     P     Gossypioides kirkii
contig_537_np12     1163     P     Gossypium herbaceum
contig_69_np12      1004     P     Gossypioides kirkii
contig_68_np12      1001     N     No significant hit
contig_65_np12      546      N     No significant hit
contig_861_np12     4855     N     No significant hit
contig_254_np12     1614     N     No significant hit
contig_119_np12     9447     B     Klebsiella quasipneumoniae
contig_988_np12*    2477     B     Klebsiella quasipneumoniae

- N = No significant hit
- B = Bacterial
- P = Plant

* fcs_gx reported this as viral

For the sequences classified by kraken2 as bacterial:

Kraken2 classification success rate = 16.67%

Kraken2 classification failure rate = 83.33%

rosscrowhurst commented 1 year ago

According to kraken2 the following 23 contigs were classified as having no hits:

CONTIG              Size     Key   NCBI Blastn vs nt (discontinuous megablast) best hit species
contig_138_np12    6919     N     No significant hit
contig_422_np12    3997     P     Fraxinus pennsylvanica genome assembly, chromosome: 22
contig_184_np12    3222     P     Buxus sempervirens genome assembly, chromosome: 4     
contig_79_np12     2689     N     No significant hit     
contig_955_np12    2050     P     Camellia sinensis small nucleolar RNA Z101 (LOC114308532), ncRNA
contig_432_np12    1991     P     Camellia sinensis small nucleolar RNA Z101 (LOC114308532), ncRNA
contig_400_np12    1586     N     No significant hit
contig_388_np12    1543     N     No significant hit
contig_418_np12    1461     N     No significant hit     
contig_407_np12    1287     N     No significant hit 
contig_185_np12    1189     N     No significant hit  
contig_236_np12    1185     P     Actinidia chinensis DNA, Y-specific genomic marker third one
contig_864_np12    1004     N     No significant hit
contig_58_np12      955     P     Buxus sempervirens genome assembly, chromosome: 4
contig_322_np12     892     N     No significant hit
contig_865_np12     865     N     No significant hit
contig_285_np12     739     N     No significant hit
contig_325_np12     575     N     No significant hit
contig_713_np12     558     P     Actinidia chinensis var. chinensis cultivar 4x chromosome 1 mitochondrion, complete sequence
contig_711_np12     553     P     Actinidia chinensis var. chinensis cultivar 4x chromosome 1 mitochondrion
contig_756_np12     531     P     Actinidia chinensis DNA, Y-specific genomic marker eighth one
contig_937_np12     496     P     Gossypioides kirkii chromosome KI_2_4
contig_57_np12      496     I     Ophion luteus genome assembly, chromosome: 7

KEY:
- N = No significant hit
- B = Bacterial
- I = Insect
- P = Plant

For the sequences classified by kraken2 as no hit:

Kraken2 classification success rate = 52%

Kraken2 classification failure rate = 48%

rosscrowhurst commented 1 year ago

@GallVp Maybe the database kraken2 is using is not inclusive of all sequences it should have?

rosscrowhurst commented 1 year ago

@GallVp - incidentally if the small contigs were culled most of the anomalies would not be detected so I wonder if Kraken2 was set up and tested just with chromosome level sequences

GallVp commented 1 year ago
    // To select a DB, see https://benlangmead.github.io/aws-indexes/k2
    // The pipeline automatically downloads the required DB if needed
    //
    // Using PlusPFP: archaea, viral, plasmid, human, UniVec_Core, protozoa, fungi & plant
    db_url              = "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20230314.tar.gz"
GallVp commented 1 year ago

We are using their largest database: PlusPFP (108 G)

rosscrowhurst commented 1 year ago

Possible reason for Homo sapiens - I took a couple of random contigs that Kraken2 claasifies as "homo sapiens' and I blasted them using BLASTn against a older copy of Genbank Genomes databases I have locally. So far 3 randomly selected contigs are returning similar results and hitting Homo sapiens genome but hits are very short regions (40 ish bases) and are all microsatellite like sequences:

Such hits are meaningless in this type of context. A possible way to improve Kraken2 spurious classification would be to run it on hard masked genome where the hypervariable microsatellite like sequences are masked with N. So suggestion is before Kraken2- run hard masking of microsatellites then run Kraken2 on this microsatellite masked genome sequence.

@GallVp @CeciliaDeng @christinawu2008

GallVp commented 1 year ago

Kraken2 using a K-mer-based algorithm. Maybe someone has investigated the accuracy of its results against the k-mer size parameter.

GallVp commented 1 year ago

We maybe able to create a taxonomic classification plot using krona from NCBI FCS GX results.