UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
49 stars 26 forks source link

Don't save sam files #119

Closed erinyoung closed 1 year ago

erinyoung commented 2 years ago

They're very large (especially for MPX)

fanninpm commented 2 years ago

It may be a good idea to compress the SAM output from bwa and minimap2 using gzip. For example,

# for short reads
bwa mem -t !{task.cpus} !{reference_genome} !{reads} 2>> $err_file | gzip -4 bwa/!{sample}.sam.gz
# or, for long reads
minimap2 !{params.minimap2_options} -ax sr -t !{task.cpus} !{reference_genome} !{reads} 2>> $err_file | gzip -4 > aligned/!{sample}.sam.gz

As a comparison, the Mad River workflow uses BBMap, and I've noticed that its default gzip compression setting is 4 (hence why I used gzip -4 in the commands above). Running gzip -l on some freshly mapped hMPXV .sam.gz files gives a space savings of just under 75%.

erinyoung commented 2 years ago

All the sam files are converted to bam files in the next process, so I'm not sure there's utility in keeping both the bam and sam files.

fanninpm commented 2 years ago

Ah, I wasn't exactly aware of that. (No wonder Cecret's output directories take up so much space!) Mad River doesn't publish any of the raw SAM output. Even so, compressing the raw SAM output will, at the very least, save scratch space.