bespin-workflows / exomeseq-gatk4

Whole Exome Sequencing in CWL using GATK4
MIT License
0 stars 2 forks source link

Use samtools multithreading capability #17

Closed johnbradley closed 4 years ago

johnbradley commented 4 years ago

User Request:

Samtools is multithreaded, we should be able to speed up converting to a bam file and sorting

We use samtools view in a gitc-bw-mem-samtools tool: https://github.com/bespin-workflows/exomeseq-gatk4/blob/44d83f93cd8fa7cfdcf2ddfa6435239a4eeb4c27/tools/gitc-bwa-mem-samtools.cwl#L11-L15

Documentation for samtools mentions a flag to use more threads for compression: https://www.htslib.org/doc/samtools-view.html

-@ INT Number of BAM compression threads to use in addition to main thread [0].

dleehr commented 4 years ago

We could add additional BAM compression threads to the bwa | samtools script above, but as this tool and script is already optimized to to maximize the number of threads allocated to bwa, I'd like to see some evidence that re-allocating some of them from bwa to samtools (in this specific tool/script) would have a benefit. Is the compressor not able to keep up with the aligner? Should be easy to test. the mapping step certainly takes a significant (though highly variable depending on platform) amount of time in the workflow.

Secondarily, the above script would get a little more complicated (e.g. either hard-code a universal value for samtools compression threads or shift an argument out). That's not a strong reason not to do it, but might not be a great first-time issue.

For sorting, we use gatk SortSam and not samtools. We only expect single-threaded performance. Perhaps that could be multithreaded, or swapped with a samtools command that is multithreaded. It looks like the SortSam step takes 15-20 minutes, about 9% of the preprocessing time of a sample.

johnbradley commented 4 years ago

I informed the requesting user that we currently using gatk SortSam instead of samtools. Their response was:

I’m not sure how the speed of it compares to a multithreaded samtools sort. The speed increase is likely going to be marginal relative to the whole pipeline.

The requesting user said we could close this ticket.