biocore / tcga

Microbial analysis in TCGA data
BSD 3-Clause "New" or "Revised" License
88 stars 44 forks source link

Kraken TCGA microbial-detection pipeline #24

Closed nlawlor closed 2 years ago

nlawlor commented 2 years ago

Hello,

I am wondering, where can I find the necessary scripts/pipeline to perform mapping of unaligned (non-human) reads (BAM files) against bacterial, viral genomes using Kraken?

My goal is to apply the same pipeline used for detecting microbial abundances (Kraken TCGA microbial-detection pipeline) in your Nature publication (https://www.nature.com/articles/s41586-020-2095-1) to different WGS/RNA-seq samples from another study.

Additionally, I am unable to access (the link does not work) the reference of 71,782 microbial genomes that were downloaded using RepoPhlan (https://bitbucket.org/nsegata/repophlan) . Is this information provided somewhere else?

Thank you for your amazing research and for providing the code! Please let me know if my questions are unclear or if you need further info from me.

Best, Nathan

ekopylova commented 2 years ago

Hello @nlawlor ,

The original database used for the paper (with 71,782 filtered genomes) was built ~5 years ago, since then the prokaryote database has grown considerably, thus I would suggest to re-build it for better accuracy. It seems the repophlan repository moved to github, there you can find all the instructions for generating a new database, and you can use the screen.py script (provided in the same repository) to filter for high-quality genomes.

The workflow itself for running Kraken was built on the Cancer Genomics Cloud given that a full Kraken index for the genome database required ~420 GiB RAM and 1.5 TB disk space to execute. Can you send me an e-mail and we can provide you with access to the pipeline?

Thanks! Jenya

nlawlor commented 2 years ago

Hi Jenya,

Thank you so much for your quick reply! Understood, I will visit that repo and rebuild the bacterial reference database.

Ok, I will send you an email shortly. Thank you again for your help!

Best, Nathan

tmrnov commented 1 year ago

@ekopylova Hi Jenya; I am also interested in applying your CGC pipeline to my data. Could be still possibile to have access to the pipeline? Thanks