improve storage usage, minimize duplication of FASTQ data

gpertea commented 11 months ago

Just had a run of SPEAQeasy (locally) on a 30-sample dataset where the compressed raw data (fastq.gz) are about 175 GB total. Running that on a fast SSD with about 1.8 TB available storage, the SSD got filled quickly and the pipeline aborted running out of space on that storage. This seems unreasonable.

It seems the main space hog is using uncompressed FASTQ files internally, in the working directories. This should and could be avoided, as most (all?) programs in the pipeline can use fastq.gz as input, or alternatively, the decompression of FASTQ can be performed on the fly if needed.

gpertea commented 11 months ago

A few notes on this issue:

merging the compressed FASTQ data per sample, from multiple lanes/flowcells, can be avoided for the tools in the pipeline -- it can be done on-the-fly for hisat2, star, kallisto, salmon etc.
it might be a good idea to keep fastqc running on original unmerged FASTQ, as the multiple lanes may have different error profiles etc.
we could also run hisat2/star aligners on the original unmerged FASTQ files for the same reason: so we can collect QC metrics and expression counts per lane/flowcell (e.g. with featureCounts at gene level); samtools merge can be then used on the fly if we ever need to have a single merged stream of sorted alignment records for any other downstream tools, or just to save into a merged BAM/CRAM file (after per-lane QC was collected), with RG tags assigned to keep track of the original FASTQ files being merged (and then deleting the per-lane BAM files).
the final alignment data can be further compressed into CRAM format, now fully supported by samtools; CRAM can be down to 1/3 of the BAM file size (when unaligned reads also stored, which we should).
trimming is the only part where an additional (trimmed) copy of the original sequence data may be created, however those trimmed FASTQ files should be kept compressed; users may often choose to skip trimming completely (since all aligners involved allow soft-clipping), which should prevent any duplication of the FASTQ data.

gpertea commented 10 months ago

quick update - implemented most of the above in the gjhpce_run branch, but that branch has some other dirty patches for local SPEAQeasy execution etc. Will try to make a pull request for the main branch later after some more testing.

LieberInstitute / SPEAQeasy

improve storage usage, minimize duplication of FASTQ data #109