DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
707 stars 271 forks source link

Classification problems #115

Closed theo-allnutt-bioinformatics closed 5 years ago

theo-allnutt-bioinformatics commented 5 years ago

I have a gut metagenome that I have classified using kraken2 and a custom database of all genomes in NCBI RefSeq. Compared to other methods, e.g. Metaphlan2, kraken2 seems to identify far more species (>10 x). When I look at some of the hits for more obscure species, the classification does not seem to make sense. e.g. see the hit below for Leishmania:

C A00152:48:H3HYJDRXX:2:2215:15347:8061 Leishmania major strain Friedlin (taxid 347515) 151 347515:53

A00152:48:H3HYJDRXX:2:2215:15347:8061 1:N:0:TCGACGTC+GAGCCTTA CTACATACTTAGGCTCAAAGCAAGTGCCGTAGTAGTTGTCCGCTTCCTCCTCATCATCGTCAGAGCCGATGCCGTCCTCGTCCTCGTCCTCGTCGCCGTACTTCTCACGCATTCGCTCATTGACCTCGCCGTTGCACTCACCGAGCTTATA

When I retrieve this read and blast it, it matches different kingdoms, plants, bacteria, insects etc. with higher or similar scores to its match to Leishmania. So my question is why does kraken2 classify this sequence? Using the LCA algorithm, it should at least place it at the taxonomic root. This makes me doubt all the other results. Yes, this is a rare read, but it should still be unclassified.

Here are my commands:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/archaea/archaea..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertibrate/invertibrate..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa/protozoa..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertibrate_mammalian/vertibrate_mammalian..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertibrate_other/vertibrate_other..1.genomic.fna.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral..1.genomic.fna.gz gunzip .gz cat *.fna >refseq.fa

kraken2-build --add-to-library /stornext/HPCScratch/home/allnutt.t/db/refseq/refseq.fa --db refseq --threads 12

db limited to 60GB ram

kraken2-build --build --kmer-len 99 --minimizer-len 31 --max-db-size 60000000000 --db refseq --threads 12

kraken2 --db ~/db/kraken/refseq --threads 12 --classified-out 1hits.fa --confidence 0.9 --report 1.report --use-mpa-style --use-names -output 1.out

I had to make the db kmer large to minimise the db size but I have found similar problems with other dbs.

Thanks.

T.

ZhouHui0916 commented 5 years ago

hi, I have same problem with you. What is the basis for setting the confidence parameter?

theo-allnutt-bioinformatics commented 5 years ago

'confidence' in the assignment, as I have set it, is 90%. I have tried up to 99% and still get these spurious assignments. I am trying 'centrifuge' now to see if it does any better.

jenniferlu717 commented 5 years ago

I do agree that the confidence threshold should be set to around 0.9.

I also blasted the read provided and it resulted in 4 hits (all with identical identities/e-values), one of which is Leishmania major strain Friedlin complete genome, chromosome 21.

However, the other 3 hits are to predicted bee proteins, mRNA sequences, not hits to the actual genomes of those organisms:

PREDICTED: Apis mellifera ras-related protein Rab-37 (LOC408741), transcript variant X11, mRNA PREDICTED: Apis mellifera ras-related protein Rab-37 (LOC408741), transcript variant X1, mRNA PREDICTED: Polistes canadensis protein capicua homolog (LOC106791611), transcript variant X8, mRNA

While the database built will include the full genomes of the Apis mellifera and Polistes canadensis genomes (in the refseq invertebrate database). these predicted sequences will not be included. Therefore, Kraken2 is performing as expected by classifying the read at Leishmania.

Setting the confidence score will require more kmers to be classified for the read itself to receive a classification.

theo-allnutt-bioinformatics commented 5 years ago

Therefore Kraken can give incorrect classifications unless the entire nt database is used? Assuming that the 'bee' sequences are correctly assigned? Is there a way to increase the top hit margin, rather than confidence, to improve this situation? i.e. so that for spurious assignments like this, there is a better chance of the root nature of the sequence being revealed by the LCA. It doesn't seem like a goo didea to only accept 'equal' top hits into the LCA - there should be a margin of say 5% of the best hit score.

Thanks,

T.

jenniferlu717 commented 5 years ago

I would not say that the entire nt database is required for accurate classifications. Our standard database only draws from refseq complete genomes because these are the genomes NCBI and many others trust the most. Countless genomes are deposited in genbank daily but NCBI does not screen all of these for errors or completeness. Refseq is more curated however.

While refseq may not have all of the genomes in the world (and neither does nt for that matter), Kraken still maintains the ability to correctly classify or nearly correctly classify reads if the organisms are known.

However, any classification tool is only as good as its database and its not the size of the database that matters but the accuracy of the original database.

As an aside: I would not say the BLAST hits to predicted protein sequences are absolute truth that the read is definitely not Leishmania major. There were only 4 BLAST hit for the read and each match was about 25 base pairs long. The other three hits were predicted protein, transcriptional variants in mRNA, with no hits to the actual genome assemblies of those organisms. If your sequencing was RNA-seq-based, I may understand the concern/belief that the sequence is not Leishmania major, but there is something to be said about the fact that the read matched the genome of Leishmania and only transcriptional variants of the other three organisms.

Regarding "top hit margin": I don't understand what you're suggesting. The way the confidence parameter works is, with confidence of 0.9, 90% of the kmers in the read must be classified within the tree of taxon A, with at least one kmer classified as taxon A, for the read to get the classification A.

If a read had 50% of kmers at the root and 20% at species A, 10% at species B( for whatever strange reason), only a confidence of at most 70% would assign the read to A, otherwise, the read would get the assignment of "root".

theo-allnutt-bioinformatics commented 5 years ago

Ok, thanks for the explanation. So if 90% of the kmers from the read had to be assigned to Leishmania (and I accept that because it is from a metagenome, it could be real) why is there only one 25 bp hit? The reads were minimum 100 bp after quality filtering, so shouldn't that read show several other kmer hits to the refseq database, also to Leishmania? Or is is it 90% of assigned kmers that contribute to the confidence? If that is the case, should it not be 90% of all kmers, assigned or not?

Thanks.