Closed dpryan79 closed 6 years ago
BTW, you can get around the compiler issue by using LLVM 6.0.0
and LDC 1.10.0
.
Thanks for the fix @dpryan79! I'll take a careful look. And yes, we are updating to LLVM and LDC latest.
Thanks @pjotrp! Can you ping me when you tag a new release? I'd like to update our pipeline with the new version then :)
Will do on the mailing list
I just realised there are one or two problems with this PR. The first one is the coordinate comparison at line 765. It may be the correct thing to do, but it will change results. We have to check what samtools is doing here.
The second problem is that to read the library_id and ref_id we fetch them from the samtools-style index file which are defined as the shorter versions. I am not sure what samtools/htslib is doing there right now. Be good to check the sizes they are promoting. We may diverge with the index format, but that is obviously something we have to do carefully.
wdyt?
The addition on line 765 doesn't change anything, it was part of *cast(ulong*)(&e1) == *cast(ulong*)(&e2)
previously, but since the relevant part of e1
and e2
no longer fit in a ulong
the comparison needs to be made to ensure that duplicates aren't over called. In my tests with real files this doesn't change any results.
ref_id
is 32 bit in htslib, though since it's signed only half of that is actually usable. We rarely need to use read groups, so I'm not overly familiar with how that's handled in htslib.
Ok I'll check the index format. Added a reminder #365.
I just managed to create a reproducable build with ldc 1.10 and LLVM 6. Will push a release soon.
This is a resolution to #326 and also something we observed internally (https://github.com/maxplanck-ie/snakepipes/issues/233 ) where markdup is limited to files with <=16383 contigs. This PR modifies the size of the structures such that all BAM files should be supported, regardless of the number of contigs/scaffolds.
In my tests this produces identical results on files already supported by
sambamba markdup
, while requiring a small (10-20% peak) increase in RAM. There's no appreciable run-time difference that I've seen.