biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
565 stars 105 forks source link

Mark duplicates in files with many contigs #361

Closed dpryan79 closed 6 years ago

dpryan79 commented 6 years ago

This is a resolution to #326 and also something we observed internally (https://github.com/maxplanck-ie/snakepipes/issues/233 ) where markdup is limited to files with <=16383 contigs. This PR modifies the size of the structures such that all BAM files should be supported, regardless of the number of contigs/scaffolds.

In my tests this produces identical results on files already supported by sambamba markdup, while requiring a small (10-20% peak) increase in RAM. There's no appreciable run-time difference that I've seen.

dpryan79 commented 6 years ago

BTW, you can get around the compiler issue by using LLVM 6.0.0 and LDC 1.10.0.

pjotrp commented 6 years ago

Thanks for the fix @dpryan79! I'll take a careful look. And yes, we are updating to LLVM and LDC latest.

dpryan79 commented 6 years ago

Thanks @pjotrp! Can you ping me when you tag a new release? I'd like to update our pipeline with the new version then :)

pjotrp commented 6 years ago

Will do on the mailing list

pjotrp commented 6 years ago

I just realised there are one or two problems with this PR. The first one is the coordinate comparison at line 765. It may be the correct thing to do, but it will change results. We have to check what samtools is doing here.

The second problem is that to read the library_id and ref_id we fetch them from the samtools-style index file which are defined as the shorter versions. I am not sure what samtools/htslib is doing there right now. Be good to check the sizes they are promoting. We may diverge with the index format, but that is obviously something we have to do carefully.

wdyt?

dpryan79 commented 6 years ago

The addition on line 765 doesn't change anything, it was part of *cast(ulong*)(&e1) == *cast(ulong*)(&e2) previously, but since the relevant part of e1 and e2 no longer fit in a ulong the comparison needs to be made to ensure that duplicates aren't over called. In my tests with real files this doesn't change any results.

ref_id is 32 bit in htslib, though since it's signed only half of that is actually usable. We rarely need to use read groups, so I'm not overly familiar with how that's handled in htslib.

pjotrp commented 6 years ago

Ok I'll check the index format. Added a reminder #365.

I just managed to create a reproducable build with ldc 1.10 and LLVM 6. Will push a release soon.