Open gpertea opened 11 months ago
A few notes on this issue:
fastqc
running on original unmerged FASTQ, as the multiple lanes may have different error profiles etc. samtools merge
can be then used on the fly if we ever need to have a single merged stream of sorted alignment records for any other downstream tools, or just to save into a merged BAM/CRAM file (after per-lane QC was collected), with RG
tags assigned to keep track of the original FASTQ files being merged (and then deleting the per-lane BAM files).quick update - implemented most of the above in the gjhpce_run
branch, but that branch has some other dirty patches for local SPEAQeasy execution etc. Will try to make a pull request for the main branch later after some more testing.
Just had a run of SPEAQeasy (locally) on a 30-sample dataset where the compressed raw data (fastq.gz) are about 175 GB total. Running that on a fast SSD with about 1.8 TB available storage, the SSD got filled quickly and the pipeline aborted running out of space on that storage. This seems unreasonable.
It seems the main space hog is using uncompressed FASTQ files internally, in the working directories. This should and could be avoided, as most (all?) programs in the pipeline can use fastq.gz as input, or alternatively, the decompression of FASTQ can be performed on the fly if needed.