biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
557 stars 104 forks source link

Sambamba sort is inefficient compared to samtools. #447

Closed rhpvorderman closed 4 years ago

rhpvorderman commented 4 years ago

The problem.

On HPC or Cloud a certain number of CPU and memory need to be reserved. On our own HPC (SLURM) the amount of resources requested determines how fast a job gets scheduled (the lesser the better). Also in the cloud the less resources you request, and the less time you need the better.

I am currently investigating which tools we can use best for whole genome sequencing. Whilst sorting a 160GB bam file I noticed that sambamba sort was slower than samtools.

Versions

Tested sambamba version: sambamba 0.7.1 Tested samtools version: 1.9 Installation method: conda

Reproducing

I took a 768MB chunk of the aforementioned bamfile. I tuned the settings of samtools and sambamba in order to mimic the fact that a whole genome bam cannot be kept in memory. I set the max memory to 128MB. I also set the amount of additional threads to 0 for fair comparison.

Benchmarking performed with hyperfine. Two warmup runs were performed to ensure as much IO was cached as possible for fair comparison. Testing was performed on a ryzen 5 3600 system with 32GB of ddr4-3200 ram and a NVME ssd.

$ hyperfine -w 2 -r 5 'sambamba sort -t 0 -l 1 -m 128M -o test.bam unsorted.bam'
Benchmark #1: sambamba sort -t 0 -l 1 -m 128M -o test.bam unsorted.bam
  Time (mean ± σ):     84.841 s ±  0.272 s    [User: 83.198 s, System: 1.483 s]
  Range (min … max):   84.529 s … 85.211 s    5 runs
$ hyperfine -w 2 -r 5 'samtools sort -@ 0 -l 1 -m 128M -o test.bam unsorted.bam && samtools index test.bam'
Benchmark #1: samtools sort -@ 0 -l 1 -m 128M -o test.bam unsorted.bam && samtools index test.bam
  Time (mean ± σ):     39.257 s ±  0.358 s    [User: 37.501 s, System: 0.952 s]
  Range (min … max):   38.922 s … 39.864 s    5 runs

I also tested Picard for context and comparison

$ hyperfine -w 2 -r 5 'picard SortSam O=test.bam I=unsorted.bam MAX_RECORDS_IN_RAM=300000 SORT_ORDER=coordinate CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT'
Benchmark #1: picard SortSam O=test.bam I=unsorted.bam MAX_RECORDS_IN_RAM=300000 SORT_ORDER=coordinate CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT
  Time (mean ± σ):     80.947 s ±  0.358 s    [User: 104.765 s, System: 1.456 s]
  Range (min … max):   80.639 s … 81.482 s    5 runs

Sambamba uses less cpu time (user time) than picard but is twice as slow as samtools for the same task.

But does sambamba make up for this ineffeciency in its multithreaded architecture:

$ hyperfine -w 2 -r 5 'sambamba sort -t 3 -l 1 -m 128M -o test.bam unsorted.bam'
Benchmark #1: sambamba sort -t 3 -l 1 -m 128M -o test.bam unsorted.bam
  Time (mean ± σ):     33.669 s ±  0.534 s    [User: 98.762 s, System: 1.904 s]
  Range (min … max):   32.856 s … 34.266 s    5 runs

Using 3 additional threads for 4 threads total. If we divide user time by wall clock time we see that sambamba utilizes 3 threads fully on average. It manages to be quicker than samtools on one thread. But not significantly so.

$ hyperfine -w 2 -r 5 'samtools sort -@ 3 -l 1 -m 128M -o test.bam unsorted.bam && samtools index test.bam'
Benchmark #1: samtools sort -@ 3 -l 1 -m 128M -o test.bam unsorted.bam && samtools index test.bam
  Time (mean ± σ):     15.057 s ±  0.023 s    [User: 39.388 s, System: 1.311 s]
  Range (min … max):   15.032 s … 15.086 s    5 runs

Samtools is much quicker when using the same amount of threads. user time/ wall clock time is slightly worse utilizing only 2.5 threads on average. So sambamba manages to split up its workload more efficiently than samtools. Unfortunately that workload is much bigger.

Solution?

I wonder if there is an oversight in the sorting algorithm that can explain this speed difference.

pjotrp commented 4 years ago

The consensus is that sambamba sort is faster on high-memory machines. Otherwise use samtools.

rhpvorderman commented 4 years ago

@pjotrp Thanks for the clarification.

Also thank you for developing sambamba. We have recently switched to sambamba markdup from Picard Markduplicates and it has a much better wall clock time.