Intel-HLS / GKL

Accelerated kernel library for genomics
MIT License
103 stars 41 forks source link

Potential memory leak observed in unusual HaplotypeCaller behavior #171

Open jamesemery opened 2 years ago

jamesemery commented 2 years ago

This is an expansion of the issue reported here: https://github.com/broadinstitute/gatk/issues/7693.

In summary we observed that when running the standard GATK warp inputs on some of our gold standard test data (which allocates 8000MB machines and allocates ~6974MB of that to java -Xmx) we start getting no-error message machine failures when the job runs for ~6-7 hours. The failures all follow the pattern of abrupt terminations of java without any sort of error message and seem to occur on the same 6ish shards. Given that there is no typical java error (and the results of one of our tests) we suspect its a memory failure occurring outside of java which doesn't definitively point to the GKL but its suspect. We discovered this when we were experimenting with sharing the HaplotypeCaller to 10 shards rather than the default 50 that we typically run it with. That is to say that running the same sample that succeeds when sharded 50 ways (where each shard takes ~2 hours) over the same data nothing fails but when we increase the runtime and genomic space for GATK by 5x we get persistent and repeatable memory failures on over half the shards.

Some experiments we have tried to track this down and their results: -Doubling the machine memory clears up the problem. This corresponds to allocating ~15GB to java xmx which it may or may not actually attempt to use over a Haplotype Caller run. -Doubling the non-java memory also evidently clears up this crash (i.e. allocating ~5974MB to java -Xmx and leaving 2gb unallocated). This this more or less rules out java memory issues. -We have tested across gatk 4.2.2.0 and 4.2.0.0 and the crashes affect both versions (for context we updated to GKL 0.8.8 in 4.2.1.0).

I am running with the standard WARP wdls on some standard data and I can share the exact tasks that are being run if need be. If there are any suggestions for techniques for tracking down if it really is the GKL (which we know often does take a significant amount of memory) or what else might be causing problems. We do expect the memory usage across sites to be spiky in gatk in general but its unexpected behavior that running over the same sites after running for 6 hours before should make a significant difference to the used memory.

droazen commented 2 years ago

We are going to do some more testing on our end to narrow down the list of possible culprits -- in particular, we'll do a run with the GKL PairHMM turned off and see if that passes.