broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
189 stars 67 forks source link

use *.bam intermediary files rather than *.sam #763

Open tomkinsc opened 6 years ago

tomkinsc commented 6 years ago

In a few places we use .sam intermediary files where we could use .bam files. The latter take a bit more IO/CPU time with the advantage of better compression ratio. One such instance is here: https://github.com/broadinstitute/viral-ngs/blob/master/tools/bwa.py#L228

We should consider switching these occurrences across the codebase to use .bam by pipling to samtools with the -b flag.

yesimon commented 5 years ago

At this stage we can probably use .cram files.

dpark01 commented 5 years ago

I think this issue was mostly about the ephemeral intermediates, not anything we present to the outside world. In that regard, the only reason to use bam over sam is so that we don't always require a VM instance to have a ton of local disk space for handling large sets of reads (say, from big sequencers). But going for cram is probably cpu-overkill on a file that we're just going to delete anyway. In fact, for this particular issue, I was thinking that we should just be using the -1 compression level flag on samtools, which optimizes for speed (you really don't want this part to be the bottleneck) while reducing unnecessary wastage on the local temp disk.