biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
565 stars 105 forks source link

Why does `sambamba sort` give me a different sort order across multiple runs (on the same BAM file)? Isn't `sort` by coordinates determinstic? #490

Closed etrh closed 2 years ago

etrh commented 2 years ago

Running sambamba sort on the same BAM input.bam file gives me differing results from one run to another. Is this expected behavior? I'm not sure what's causing this but the order of queries seem to be flipped when I run a diff.

I'm using sambamba 0.8.2 and compare the BAM files by looking at their MD5 hashes via:

samtools view sorted_input.bam | md5sum

The amount of memory (-m) is seemingly playing a role here as well. I ran this 2 times:

sambamba sort -m 3500MB -o input_sorted_with_3500MB.bam input.bam

and the resulting MD5 hashes from both runs were the same.

Then I changed the amount of memory slightly and ran:

sambamba sort -m 3200MB -o input_sorted_with_3200MB.bam input.bam

These two runs with 3200MB memory shared the same MD5 sums. However, those MD5s were different from the prior runs with 3500MB memory.

Strangely enough, I also tried running sambamba sort on the already sambamba-sorted BAM file, and to my surprise the MD5 sums were again different between the sorted and twice-sorted BAM files. (This does not happen when I run samtools sort on a BAM file that has already been sorted by sambamba. That is to say, the MD5 hash remains the same after running samtools sort on an already sambamba-sorted BAM file)

I'm really confused why this is happening. Isn't sambamba sort supposed to be deterministic?

I have confirmed that all of these different BAM files with differing MD5sums are in fact identical when sorted by read names (i.e. with samtools sort -n), however, coordinate sorted ones have swapped rows in them. These swaps seem to happen within the same/shared RNAME:POS key/group. So the sorting seems to be mixed at the level of QNAME and FLAG within the same shared key (key here being RNAME and POS combination).

mschilli87 commented 2 years ago

Tl;DR: Sorting is probably deterministic, but compressing the result not necessarily.


If you want to compare the content of the files, you should convert your BAMs to SAMs via sambamba view prior to comparing them. You are like picking up differences in how the same information is compressed differently. Without looking at the code, I am sure compression happens in som sort of block fashion via a buffer of a certain size.

etrh commented 2 years ago

@mschilli87 I just converted two of the BAM files in question into SAM and the resulting MD5sums (md5sum converted_input.sam) were exactly the same as samtools input.bam | md5sum. In other words, the BAM-to-SAM converted file and the output of samtools view input.bam are identical.