Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

A rule for FASTQ merging #19

Closed pwl closed 6 years ago

pwl commented 6 years ago

The data I'm analyzing comes in a form of several fastq files, like so

sample1_S11_L001_R1_001.fastq.gz
sample1_S11_L001_R2_001.fastq.gz
sample1_S11_L002_R1_001.fastq.gz
sample1_S11_L002_R2_001.fastq.gz
sample1_S11_L003_R1_001.fastq.gz
sample1_S11_L003_R2_001.fastq.gz
sample1_S11_L004_R1_001.fastq.gz
sample1_S11_L004_R2_001.fastq.gz

The current version of dropSeqPipe only allows for one fastq file per sample per read so I'm basically merging all the files by hand into sample1_R{1,2}.fastq.gz before running your pipeline. Do you think it would make sense to add a rule for merging several fastq with the same root name like sample1 in the above case? I'm fine merging them by hand but adding a rule would allow to run the merging in parallel to, say, index generation.

seb-mueller commented 6 years ago

This looks like Next-Seq generated data. In my opinion it is to best pool those independently followinging the Unix philosophy (do one thing and do it well). But that's just my opinion.

As for your example. In fact, if you have control over sequencing process, you can pool do this automatically in the bcl conversion step using bcl2fastq:

bcl2fastq2 --no-lane-splitting

If you don't, I've written a small script that might help you with this.

pooling_nextseq500_multisample.sh PATH
pwl commented 6 years ago

I'm already merging the files like so

        find /input -name "$sample_*$r*.fastq.gz" \
            | sort \
            | xargs zcat \
            | gzip --fast \
                   > $data/"${sample}_$r.fastq.gz"

and no, I've got no control on how the reads are pooled.

I agree with you, perhaps that sort of pooling should best be left for the user, as there are many possible naming conventions and ways to marge this data. Thansk again for a comment!

pwl commented 6 years ago

By the way, I just stumbled onto this post: https://stackoverflow.com/a/26739957. Looks like you can simply concatenate gzip files, which is waaay faster then unpacking, concatenating and compressing them. So now I simply do

        find /input  \
             -type f \
             -name "${sample}_*$r*.fastq.gz" \
            | sort | xargs cat \
                    > $data/"${sample}_$r.fastq.gz"

It's orders of magnitude faster, and basically purely IO bound.

EDIT: I forgot the sort and added brackets around sample.

seb-mueller commented 6 years ago

I think this has been discussed a lot and just concatenating would indeed save a lot of effort. However, I remember having seen reports of not all programs coping with this properly, could only pull out this quickly. Maybe this has changed, I'd love to see an actual test to prove this is sufficient (which might well be true). I guess keeping an eye on read counts should be sufficient anyway.

Hoohm commented 6 years ago

I was gonna say the same thing, just cat the files, although be careful of the order!

pwl commented 6 years ago

I'm happy to report that dropSeqPipe works just fine with concatenated gzip files.