Feature addition request

methylnick commented 6 years ago

Thinking of adding a sample contamination check into the pipeline to get an assessment on sample purity.

Will become an increasing issue for those playing in microbiome/host genomics. But also for xenograft experiments (human/mouse) as examples.

One tool I have used is fastq screen https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

as a suggestion, I am sure there are other equivalent tools.

serine commented 6 years ago

Adding another tool that relates to this thread, http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools

I think this is a tool that can sample fastq file and blast to see different species contamination

pansapiens commented 5 years ago

I've been experimenting with mash screen: https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes

For all these tools you need some kind of reference database(s) to screen against, which can involve different amounts of mucking around to setup properly depending on the tool (eg Bowtie indices vs pre-computed Bloom filter databases vs pre-computed 'sketch' indices).

mash screen

Pros: single (relatively) small reference database (RefSeq genomes) is provided, simplifying setting up a pretty comprehensive screening db. Pretty fast.

Cons: no MultiQC plugin (yet)

fastq_screen

Pros: MultiQC support. Database download is now simple (but huge and slow) since they've added the --get_genomes option (used to be more mucking around, which I why I previously felt RNAsik should explore another option). References databases provided by fastq_screen are probably better than the mash RefSeq database for routine screening since the fastqc_screen databases are built for the task and include common contaminants, adapters etc as well as model organisms.

Cons: Precomputed reference database might not be as comprehensive as the mash ReqSeq database for detecting more obscure organisms. Bowtie / BWA dependency - not a big issue now we are recommending conda as the supported deployment method.

biobloom

Pros: MultiQC support.

Cons: As far as I can tell, no precomputed reference databases are provided.

pansapiens commented 5 years ago

Another option to consider (designed more for human data with potential microbial contamination):

PathSeq

http://software.broadinstitute.org/pathseq/ https://software.broadinstitute.org/gatk/documentation/article?id=10913

Doesn't appear to be in bioconda, so probably a non-starter :/

MonashBioinformaticsPlatform / RNAsik-pipe