DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 266 forks source link

16S databases or full database? #770

Open AlexandreThibodeauUdM opened 8 months ago

AlexandreThibodeauUdM commented 8 months ago

Hey all, I am starting to try to integrate kraken2 in my Mothur pipeline. I really love Mothur, but the taxonomic assignation is not optimal for my situation. This is a work in progress so far as I am still working on the pipeline that would get me full sequences (not just unique) split across my samples that are cleaned by Mothur, run Kraken2 on each sample, run Braken, combine reports, make Biom file, run in R (Phyloseq) in my normal analysis code.

So, for the Kraken2 part, this is what I have done. Downloaded bacteria and archea library and build the database but had to cap it at 50go (ram "issue" on my computer) (full dtababase would be 67 go and I have 65 on my computer, shame). Ran it on test file (final fasta file from mothur, not what I will use because it contains only unique sequences, but used it just for fun) and I am relatively happy with the classification, especially at the genus level. My other option would be to use as library the RDP or Silva databases already prepared by the kind people of Kraken2.

So, questions:

Has anyone compared, for 16S, "full" databases vs special 16S databases? if yes, what is the performance between the 2?

If one choose the special 16S databases, is there a way to combine both RDP and Silva into 1 database for taxonomic assignation?

Thanks you for your time. Have a nice day!

AlexandreThibodeauUdM commented 7 months ago

Here is an uptade:

Used mothur up to classification step. Deunique the sequences using Mothur Split by group using Mothur Removed "-" in the Mothur sequences using sed command on Ubuntu Ran Kraken 2

While loking at reports:

Tried Silva 16S, with different kmer size lower then the default 31 or even higher. Results: bad classification of my positive control: I did found most correspounding taxa, but % of classification is off by a mile.

Still waiting to retry bacteria + archea "full" database if problem with downloading resolves.

Please update RDP special database for I can also try it.

AlexandreThibodeauUdM commented 7 months ago

Silva 16S is unusable to classify my positive control 16S 2x 250 bp Illumina sequenced. Used Mothur up to OTU making Deunique sequenced split groups in the fastas, removed "-" character because if I leave them there is no classification ran kraken 2 using different combination of Silva 16S special database build.

Bacteria download still not working. Well, I tried!

jenniferlu717 commented 6 months ago

I think we may have to remove the RDP support as RDP itself is no longer available/being supported.

Why are there "-" characters in your sequences?

Did you try any of the minikraken databases available here: https://benlangmead.github.io/aws-indexes/k2