problems with classification using nt database

ajaybabu27 commented 6 years ago

Hi Derrick,

I tried classifying my sample reads with kraken database containing only "nt" sequences and found that all my reads were classified at the root. Here is the output from the first few lines of the report file:

100.00% 14552 14535 R 1 root 0.12% 17 0 R1 131567 cellular organisms 0.12% 17 17 D 2 Bacteria

I did the same classification using the "standard database" and I got similar results that I found in Kraken1:

100.00% 14552 0 R 1 root 100.00% 14552 0 R1 131567 cellular organisms 100.00% 14552 0 D 2 Bacteria 44.38% 6458 0 D1 1783270 FCB group 41.62% 6057 0 P 1224 Proteobacteria 14.00% 2037 7 D1 1783272 Terrabacteria group

I wonder if there are some kmers that are found in some bacterial species, are also annotated in weird categories placed near the root of the taxonomy tree; specifically in the case of the "nt" database ?

tseemann commented 6 years ago

"with kraken database containing only "nt" sequences "

How did you build the database?

What does kraken2-inspect say? See http://ccb.jhu.edu/software/kraken/MANUAL.html#inspecting-a-kraken-2-databases-contents

ajaybabu27 commented 6 years ago

I built the "nt" database with the following commands,

kraken2-build --download-taxonomy --db kraken2_db_custom_20180718
kraken2-build --download-library nt --db kraken2_db_custom_20180718 kraken2-build --build --threads 36 --db kraken2_db_custom_20180718

Essentially the database only consisted of sequences from the "nt" library. If I am not wrong, "refseq" is a subset of "nt" library. Thus, I only used "nt" library to build my database.

I gave up running kraken2-inspect for my database (kraken2_db_custom_20180718) since it kept running continuously for 2 days (using around 300 Gb of memory and one core). Unfortunately, the script is not multi-threaded.

Have you tried classifying your metagenomic samples using the "nt" database (the way I have built it) ?

tseemann commented 6 years ago

I haven't built the nt database, but am considering it. I work in microbial genomics so I only want viruses -> protists + human. I'm happy for the rest to be unclassified. My main issue even with refseq is mistakes in taxonomy at species level.

DerrickWood commented 6 years ago

@ajaybabu27 Without knowing what your sample reads are (simulated? source genomes?), it's difficult to say why they are now classified at the root, beyond that the data in those sample reads looks to come from more than one kingdom (when comparing against the whole of nt).

ajaybabu27 commented 6 years ago

@DerrickWood It is a 16S (V4) amplicon gut microbiome sample (human). I am curious to test if Kraken2 with "nt" database can detect any contaminations in my sequencing library (barcoded+non-barcoded reads) in addition to classifying my reads to different taxa. I have used Kraken1 and Kraken2 (standard db) for classifying the above mentioned demultiplexed sample (in my first query) in the past and it gave expected results. I find it hard to believe that almost all of my reads in this sample also maps to additional kingdoms when using "nt" .

Is there a way I can get a lookup table of all the kmers annotated across different taxa in a given kraken db ? This way I could debug to see where the kmers from my reads would potentially map in the taxonomy tree given the annotations of LCA kmers to each taxa in the tree. Let me know if this would be possible.

Thanks, Ajay.

DerrickWood commented 6 years ago

@ajaybabu27 Unfortunately, Kraken 2 doesn't easily allow for looking up which k-mers are in the database. That is part of the tradeoff made to get the database size smaller in Kraken 2. You could classify the entire library and then use that information to approximate that, but that isn't a trivial (nor memory-light) task.

In terms of debugging, though, it might be helpful to look at the Kraken output for classifying your reads. The final tab-delimited column is a space-separated list indicating which k-mers were associated with specific LCAs. You could also try using something like BLAST (probably MEGABLAST, not BLASTN) to search some of these sequences against nt, and review the top several hits for the sequences. It's possible that nt has some poor taxonomic annotations that could confuse the assignment of these kinds of reads for Kraken.

tseemann commented 6 years ago

"It's possible that nt has some poor taxonomic annotations"

Do you have a feel for how bad it could be?

Even for Refseq there are species level mistakes, but nt is made of lots of very old sequences.

rwst commented 6 years ago

There are not only problems with taxonomic classifications of databases but also with contamination, see Lu/Salzberg, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006277. I found out the hard way yesterday that protozoa gives false Toxoplasma and Plasmodium positives, just as described in the above paper with EuPathDB.

I think providing clean databases may need a concerted effort.

ajaybabu27 commented 6 years ago

@tseemann, I agree with @rwst. I was trying to classify non-barcoded reads from a sequencing run. It should at the least have 30% PhiX reads (estimated based on direct mapping of non-barcoded reads to PhiX genome). Yet, when I classified using RefSeq Database (Kraken 1 and Kraken 2), only 3% of reads were classified as PhiX. I suspect PhiX sequence contamination is pervasive in NCBI RefSeq database since its used as a spike-in control in all Illumina based sequencing runs. See, https://standardsingenomics.biomedcentral.com/articles/10.1186/1944-3277-10-18

jenniferlu717 commented 4 years ago

Given how old this thread is, I'm going to close this issue. If you continue to have questions/concerns, please open a new issue.

Regarding testing where kmers from the database will end up classifying. you can run the first couple steps of the bracken manual (Step 1a) https://ccb.jhu.edu/software/bracken/

DerrickWood / kraken2

problems with classification using nt database #34