broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

Closed droazen closed 8 years ago

droazen commented 8 years ago

After doing this, do another comparison run against GATK3 BQSR to see how much eliminating this code bought us in terms of performance.

akiezun commented 8 years ago

To clarify, this means that the disable_indel_quals parameter will always be true, right @eitanbanks ?

eitanbanks commented 8 years ago

That's correct, @akiezun. However, it's not just stripping out that code. There are a ton of optimizations that can then be made to the code to simplify it afterwards. These classes were made very bulky to accommodate the indel calibration, and we should really remove that bulk. I can help with that.

akiezun commented 8 years ago

agreed, just wanted to know the top level starting point.

akiezun commented 8 years ago

CEUTrio.HiSeq.WEx.b37.NA12892.chr10.bam (1.2Gb on local drive) Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17;

initial optimized run on 1.2Gb file after ripping out some of the indel code: 4.pre-alpha-71-g50b094b

real    4m58.795s
user    5m3.401s

For reference, here are times for pre-optimization GATK3 and GATK4 (some data from #1033)

GATK4 master branch, with indels

real    9m7.691s
user    9m2.250s

GATK4 master branch, the -DIQ option (ie skip indel quals)

real    5m19.914s
user    5m14.710s

(which is already faster than GATK3.4.46 - numbers listed below)

GATK3.4.46 with indels

real    11m21.538s
user    17m24.320s

GATK3.4.46 with the -DIQ option (ie skip indel quals)

real    6m3.662s
user    10m34.859s

For reference, the best possible bottom line (for bqsr optimizations, not reading/writing itself) is established by PrintReads on the same data:

real    2m48.873s
user    2m27.752s
eitanbanks commented 8 years ago

@akiezun it's really important to make sure that the results match between the original and newly optimized versions (and if they don't, to make sure we understand why). Some of the "indel stuff" should probably stay, e.g. the masking of sites using known indels. Since we aren't running Indel Realigner anymore, there may be some "errors" that we do want to mask because they are alignment artifacts and not sequencing errors. Pinging @yfarjoun for his thoughts. We can discuss this at a future methods meeting if you like.

akiezun commented 8 years ago

for now this is just the apply step.

vdauwera commented 8 years ago

Hey folks, based on this effort, at what point do we start telling users to disable indel quals (or do it for them by default) in the output of BQSR in GATK3? People complain about the file sizes and this would alleviate some of that pain. I know that that's how it's done in production.

akiezun commented 8 years ago

a little bit shaved off. working on more

real    4m40.226s
user    4m42.279s
eitanbanks commented 8 years ago

@vdauwera we stopped using indel quals a while back...

akiezun commented 8 years ago

removed boxing:

real    4m35.791s
user    4m38.627s
akiezun commented 8 years ago

don't even compute indel covariates on applyBQSR

real    4m21.122s
user    4m23.733s
akiezun commented 8 years ago

reduce memory allocation at calls using varargs on applyBQSR:

real    4m11.298s
user    4m17.046s
akiezun commented 8 years ago

pre-compute the platform from the header rather than recompute it for every single read

real    4m7.124s
user    4m12.170s
akiezun commented 8 years ago

@droazen looking into the ripping out a bit more, I think it's too disruptive for alpha. The recalibration table will change and it will require a more thorough validation. As this is a potentially results-changing change, I vote to move this past alpha. I have removed the indel calculations from the ApplyBQSR because that does not change any semantics. (i propose instead to nominate https://github.com/broadinstitute/gatk/issues/1078 for alpha - I can put that in very quickly)

droazen commented 8 years ago

@akiezun For alpha-1 we should either do this or decide that it's not worth doing and close the ticket.

akiezun commented 8 years ago

let's push past alpha-1. I'd like to focus on non-disruptive speedups and on eval. @droazen ok?

droazen commented 8 years ago

@akiezun Yes, agreed -- moving the milestone for this one.

akiezun commented 8 years ago

Ongoing analyses in gsa6:/local/akiezun/gatk4_bqsr_deleteIndels_v2 branch with indels removed origin/ak_removeIndelsFromBQSR_take2

The analysis is to run with and without indels and compare recalibrated quals with and without binning

akiezun commented 8 years ago

Update1: Results are qual-by-qual identical on our test data src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam

Update2: Results are qual-by-qual identical on a 30GB exome file on gsa6 /local/akiezun/data/CEUTrio.HiSeq.WEx.b37.NA12892.bam

runtime of BaseRecalibrators (on the 30GB)

- with indels 115.0 minutes    100%
- without indels 102.8 minutes  89%

runtime of ApplyBQSR

- with indels 43.7 minutes (100%)
- without indels 51.7 minutes  <-- (109% weird, no idea why. rerunning to re-measure)
akiezun commented 8 years ago

rerun on 30GB

BaseRecalibrator with indels 119.7 minutes (100%)
BaseRecalibrator no indels    96.1 minutes  (80%)
ApplyBQSR with indels 48.5 minutes 100%
ApplyBQSR no indels 48.4 minutes   100%
droazen commented 8 years ago

what was the cause of the previous ApplyBQSR anomaly?

akiezun commented 8 years ago

how do i know? NFS or other people using gsa6 maybe

akiezun commented 8 years ago

for comparison, GATK3 on same file BaseRecalibrator 129.29 min PrintReads 232.25 min

akiezun commented 8 years ago

this one is done