water control reads removal

DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system

MIT License

727 stars 273 forks source link

water control reads removal #290

Open dcm9123 opened 4 years ago

dcm9123 commented 4 years ago

Hello!

I am relatively new to metagenomics and I've been using Kraken2 for my analysis. I was wondering if Kraken2 has some sort of way of removing DNA reads that do not belong to the sample analyzed. For instance, I've been doing analysis on pulmonary tract of patients (n=6) and a water control. In my water control I encountered a relatively low number of bacteria, archaea, viruses, and a lot of human. Is there any way that kraken2 eliminates the water control reads from the clinical samples analyzed? I am guessing that the WC reads belong to lab contamination and handling of samples and material.

Thanks in advance,

Daniel

AGalanis97 commented 4 years ago

As far as I know that's not possible using a specific kraken command, but you can remove contaminants either downstream or upstream of your analysis. You can look at decontam which you can incorporate in your pipeline https://github.com/benjjneb/decontam

jenniferlu717 commented 4 years ago

Currently, no, we do not have a script for that. Kraken-related scripts can be found at https://github.com/jenniferlu717/KrakenTools.

The extract_kraken_reads.py script can allow you to modify the samples by removing sequences (--exclude) matching a given set of taxonomy IDs (and their --children and/or --parent taxids).

If you know the taxonomy IDs classified in the water control, you can provide them to that script.

Otherwise, pavian (visualization tool https://github.com/fbreitwieser/pavian) can be used to compare the samples and then you can subtract the water control reads from the other sample reads. Let me know if you have any questions.

promexjm commented 4 years ago

Is there a plan to also output kmer index (or uniq id) in kraken results (*.out file), so that kmer existing in water control can be excluded from real samples