Closed pwl closed 6 years ago
This looks like Next-Seq generated data. In my opinion it is to best pool those independently followinging the Unix philosophy (do one thing and do it well). But that's just my opinion.
As for your example. In fact, if you have control over sequencing process, you can pool do this automatically in the bcl conversion step using bcl2fastq:
bcl2fastq2 --no-lane-splitting
If you don't, I've written a small script that might help you with this.
pooling_nextseq500_multisample.sh PATH
I'm already merging the files like so
find /input -name "$sample_*$r*.fastq.gz" \
| sort \
| xargs zcat \
| gzip --fast \
> $data/"${sample}_$r.fastq.gz"
and no, I've got no control on how the reads are pooled.
I agree with you, perhaps that sort of pooling should best be left for the user, as there are many possible naming conventions and ways to marge this data. Thansk again for a comment!
By the way, I just stumbled onto this post: https://stackoverflow.com/a/26739957. Looks like you can simply concatenate gzip files, which is waaay faster then unpacking, concatenating and compressing them. So now I simply do
find /input \
-type f \
-name "${sample}_*$r*.fastq.gz" \
| sort | xargs cat \
> $data/"${sample}_$r.fastq.gz"
It's orders of magnitude faster, and basically purely IO bound.
EDIT: I forgot the sort
and added brackets around sample
.
I think this has been discussed a lot and just concatenating would indeed save a lot of effort. However, I remember having seen reports of not all programs coping with this properly, could only pull out this quickly. Maybe this has changed, I'd love to see an actual test to prove this is sufficient (which might well be true). I guess keeping an eye on read counts should be sufficient anyway.
I was gonna say the same thing, just cat the files, although be careful of the order!
I'm happy to report that dropSeqPipe works just fine with concatenated gzip files.
The data I'm analyzing comes in a form of several fastq files, like so
The current version of dropSeqPipe only allows for one
fastq
file per sample per read so I'm basically merging all the files by hand intosample1_R{1,2}.fastq.gz
before running your pipeline. Do you think it would make sense to add a rule for merging several fastq with the same root name likesample1
in the above case? I'm fine merging them by hand but adding a rule would allow to run the merging in parallel to, say, index generation.