Open tomkinsc opened 6 years ago
At this stage we can probably use .cram
files.
I think this issue was mostly about the ephemeral intermediates, not anything we present to the outside world. In that regard, the only reason to use bam over sam is so that we don't always require a VM instance to have a ton of local disk space for handling large sets of reads (say, from big sequencers). But going for cram is probably cpu-overkill on a file that we're just going to delete anyway. In fact, for this particular issue, I was thinking that we should just be using the -1
compression level flag on samtools, which optimizes for speed (you really don't want this part to be the bottleneck) while reducing unnecessary wastage on the local temp disk.
In a few places we use
.sam
intermediary files where we could use.bam
files. The latter take a bit more IO/CPU time with the advantage of better compression ratio. One such instance is here: https://github.com/broadinstitute/viral-ngs/blob/master/tools/bwa.py#L228We should consider switching these occurrences across the codebase to use
.bam
by pipling to samtools with the-b
flag.