DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
692 stars 268 forks source link

Clarification using Kraken2 for 16s Amplication sequencing #140

Closed MonicaSteffi closed 4 years ago

MonicaSteffi commented 4 years ago

Dear All, We have done V1-V9 illunima sequencing for our amplicon sequence analysis. Ive executed standard QIIME2 pipeline using GG database. But in qiime2 results, i didnt get any species level information. Hence I tried with kraken2 (ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v1_8GB_201904_UPDATE.tgz) database and GG special database. But the taxonomy which i get from these two database varies dramastically in species level. For example: In NCBI ref seq: The top bacteria in Pseudomonas aeruginosa where as in GG results, Pseudomonas Veronii. Which is authenticated? Kindly Guide me

pedrocr83 commented 4 years ago

Hey! you should have a look at the Kraken2 Manual to see how the taxonomy is attributed. The algorithm divides the reads into k-mers with a certain size and then attributes these to a certain taxa while QIIME2 uses the full reads, clusters them and attributes the taxa then. Try using Kraken2 with the --confidence parameter and using the Bracken algorithm also (https://ccb.jhu.edu/software/bracken/) . Then you could compare the results and see if that made a difference. I would be interested in the outcome btw...

MonicaSteffi commented 4 years ago

Hi @pedrocr83 Thank you for the suggestion. Ill try, compare the outcome and share with u

MonicaSteffi commented 4 years ago

he algorithm divides the reads into k-mers with a certain size and then attributes these to a certain taxa while QIIME2 uses the full reads, clusters them and attributes the taxa then. Try using Kraken2 with the --confidence parameter and using the Bracken algorithm also

Hi @pedrocr83 ive tried with the --confidence parameter with 0.01, 0.1, 0.5 and 1. for 0.5 and 1: all the sequences are categorized as unclassified. (100 % unclassified) for 0.1: 99% are unclassified and only 1% is classified For 0.01: 50-70% are unclassified and 30% are classified
how to decide the confident score?? and for the confidence 0.01: the result is almost same as the before

pedrocr83 commented 4 years ago

Hi @selffi,

That is odd. But it seems to be related to your reads and not the Kraken algorithm. You say you are using v1-v9 regions. Do you have forward and reverse reads?

I use the confidence score with the 0.85 value and i get over 90% classified. Even if i do 100% confidence i still get like 50% reads classified. I use multiple regions and have forward and reverse reads. I also use the --paired parameter after separating my fastq file into forward and reverse fastq to increase my classification score.

MonicaSteffi commented 4 years ago

Hey @pedrocr83 I am using V1-V9 sequence and forward and reverse reads. I used trimgalore to remove bad quality reads and adaptors and executed the following command: ./kraken2 --db minikraken2 --paired --confidence 0.01 --threads 4 --report 1.report --output GG.output R1_val_1.fq R2_val_2.fq

pedrocr83 commented 4 years ago

Hey @selffi

First issue that i see is that you are using the minikraken db and not a 16s rRNA db such as GreenGenes 13_8 for example. I suggest you change it, you can download it from the kraken2 website. Because this is specific to 16s it should boost your classified reads numbers greatly.

Also it is possible that you are removing too many bases when trimming the adaptors which can mean that the algorithm cannot overlap the forward and reverse reads... I never used trimgalore so i'm unsure how it works. Can you try without the quality control?

Also try this way

./kraken2 --threads 4 --confidence 0.85 --db grengenes_db --paired R1_val_1.fq R2_val_2.fqq --report 1.report

Are you using the GreenGenes db when using QIIME2?

MonicaSteffi commented 4 years ago

hi @pedrocr83

First issue that i see is that you are using the minikraken db and not a 16s rRNA db such as GreenGenes 13_8 for example. I suggest you change it, you can download it from the kraken2 website. Because this is specific to 16s it should boost your classified reads numbers greatly.

I tried both the databases (ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v1_8GB_201904_UPDATE.tgz) and Specialized 16S GG database. As I mentioned in my issue, I got the different species level.

Also it is possible that you are removing too many bases when trimming the adaptors which can mean that the algorithm cannot overlap the forward and reverse reads... I never used trimgalore so i'm unsure how it works. Can you try without the quality control?

My reverse reads quality is poor after the 200 bps and also sequence have adaptor contamination. If I perform without quality processing steps, I may lead to the wrong conclusion. right?

Are you using the GreenGenes db when using QIIME2?

Yeah. In qiime2, I tried both SILVA and GG. But unfortunately, I could not get the species level information. In order to get the species level information, I shifted to kraken2

pedrocr83 commented 4 years ago

Hi @selffi

When you used GG in Kraken do you get more % in classified reads?

I found out that getting species level info with kraken is possible and accurate but you need to have a good confidence level. As as example, i was getting salmonella hits when the reads where either clones or e-coli. Only when i increased the confidence i was able to get the right hits.

Also having 200 bps should be sufficient as only the V4 region can go that far and beyond.

Furthermore given that you have multiple regions you should consider doing a pipeline like the one on the article (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4734828/) (using Kraken as a classifier). This will solve 2 problems, it will definitely get you species level accuracy because your read length will be closer to 1400 bps. (best case scenario), while also give you more accurate relative frequencies for the bacteria in your files as you are not getting hits for each individual region but for the whole(ish) 16s rRNA.

jenniferlu717 commented 4 years ago

Minikraken does not have quite as many bacterial genomes (and not as focused on the 16S sequences) as the 16S databases, hence the difference between the two. Confidence thresholds of 0.1 should be sufficient, although this may lead to more unclassified reads. Otherwise, the previous information discussed above is accurate.