biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
565 stars 105 forks source link

depth region -m: meanCoverage influenced by preceding BED target #290

Open melferink opened 7 years ago

melferink commented 7 years ago

Hi Artem,

I've think I found a bug in sambamba depth (region). It seems like the meanCoverage output of a target in a BED file is influenced by the location of the surrounding target (in this example the target before the target of interest). Below are 3 BED files, with a different location for the first target. The second target has three different values for meanCoverage eventhough the target region remained the same (1:138429-139409).

BED file1: chrom chromStart chromEnd readCount meanCoverage 1 138424 138428 34 34 1 138429 139409 716 106.022 1 139610 139800 132 52.4

BED file 2: chrom chromStart chromEnd readCount meanCoverage 1 138324 138328 21 20.75 1 138429 139409 716 106.503 1 139610 139800 132 52.4

BED files 3 chrom chromStart chromEnd readCount meanCoverage 1 138224 138228 8 7.5 1 138429 139409 716 106.154 1 139610 139800 132 52.4

See attached file for more details, including BAM, BED, and commands used. I've used sambamba v0.6.5, but can also reproduce it with 0.6.6. Haven't tried any old versions.

Any idea what could cause this?

Thanks!

target_test.tar.gz

lomereiter commented 7 years ago

Thanks for all the details. Calculation of number of bases per region in mate overlap scenario is apparently buggy, which is not really surprising as it's a quite complex bit of code. It seems to fail when mate overlap happens on the border of a region from BED file.

melferink commented 7 years ago

Ah, thanks. Indeed, without the -m flag the differences disappears. (although that's not what I want)

An additional remark: calculations for -T are correct (similar) with or without -M. So apparently the bug is only within the mean-coverage calculation.