bhattlab / bhattlab_workflows

Computational workflows for metagenomics tasks, by the Bhatt lab
http://www.bhattlab.com
46 stars 15 forks source link

Demultiplexing tip for undetermined fastq files #22

Open bfremin opened 5 years ago

bfremin commented 5 years ago

We have been getting data back as a giant fastq file of undetermined reads (instead of bcl) with the barcode in the read name. Most tools that demultiplex from fastq were very slow, could not be parallelized, and/or failed. This is just a pre-preprocessing tip.

You need two files (a file that lists your barcodes, and a script)

barcodes.txt: samplenameA GGACTCCT+AGAGGATA samplenameB TAGGCATG+AGAGGATA samplenameC CTCTCTAC+AGAGGATA ...all your samples

demultiplex.sh

!/bin/bash

module load sickle/1.33

demultiplex samples

grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_1.fq} | gzip > $1_1.fq.gz & grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_2.fq} | gzip > $1_2.fq.gz & wait

remove instances that do not have pairs (trimming will fail if you do not)

sickle pe -f $1_1.fq.gz -r $12.fq.gz -t sanger -o paired$11.fq -p paired$1_2.fq -s $1_single.fq

Run: cat barcodes.txt | xargs -l bash -c 'sbatch ..... demultiplex.sh $0 $1'

Will save you a lot of time instead of trying existing tools.

elimoss commented 5 years ago

It would be extremely useful to incorporate this into this workflow in some automated fashion

bfremin commented 5 years ago

Yeah I can try something. It is only 2 commands though.

elimoss commented 5 years ago

if you feel like tackling this, by all means do it and submit a pull request. it'll need the dependency taken care of with either conda or a container, and the new input will have to be integrated into the config, workflow and docs