LieberInstitute / SPEAQeasy

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.
http://lieberinstitute.github.io/SPEAQeasy
MIT License
6 stars 4 forks source link

improve storage usage, minimize duplication of FASTQ data #109

Open gpertea opened 11 months ago

gpertea commented 11 months ago

Just had a run of SPEAQeasy (locally) on a 30-sample dataset where the compressed raw data (fastq.gz) are about 175 GB total. Running that on a fast SSD with about 1.8 TB available storage, the SSD got filled quickly and the pipeline aborted running out of space on that storage. This seems unreasonable.

It seems the main space hog is using uncompressed FASTQ files internally, in the working directories. This should and could be avoided, as most (all?) programs in the pipeline can use fastq.gz as input, or alternatively, the decompression of FASTQ can be performed on the fly if needed.

gpertea commented 11 months ago

A few notes on this issue:

gpertea commented 10 months ago

quick update - implemented most of the above in the gjhpce_run branch, but that branch has some other dirty patches for local SPEAQeasy execution etc. Will try to make a pull request for the main branch later after some more testing.