Super-weird non-determinism in HaplotypeCaller in GATK 4.1.4.1

tfenne commented 4 years ago

This is more of a question than an outright bug report. I've observed something very strange today, that I cannot reproduce, and am looking for some help figuring out what's going on.

The long and the short of it is that I'm operating on a commercial platform where I've run the same job 4 times. I can see logs and confirm that a) the exact same docker image is used for all four runs, b) the exact same GATK command is use for all 4 runs, and c) the exact same inputs are provided to each of the four runs. I can't share details (yet) but I'm 99.99% confident that I'm executing the exact same code on the exact same input data and getting quite different results.

Specifically the first job produces output that is different from the remaining three jobs, which are all identical (except for datetimes in the headers of VCFs).

The outlier run misses a number variants (about 10% vs. the other three runs). And the entries in the gVCF where the variants are missed are weird. E.g. there'll be a gVCF entry for a single base where if you believed the data in the gVCF here would be no reason to emit a separate block. And that entry will have high coverage (e.g. DP=800), assign all the coverage to the REF allele (the site is clearly about 50/50 het in IGV) and emit GQ=0 for GT=0/0.

One very noticeable difference is that the three "good" runs complete traversal without any warnings, but that "bad" run emits the following warning once:

WARN  DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null

and the following warning many times (~350):

2020-10-08 21:15:12 bam_to_vcf STDERR 03:15:12.397 WARN  StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null

I have a theory about what's going on, and I'm hoping someone who is more knowledgable can tell me if my theory is sensible or impossible, and if there's anything I can do to confirm it. My theory is this: that a) the one bad job got run on a compute instance that has a hardware issue that intermittently affects only AVX operations, b) that the Intel native PairHMM doesn't handle that situation gracefully but instead returns an empty likelihoods map and c) that's causing the warnings I'm seeing the discrepancies in the gVCFs.

I'm at a bit of a loss for what to do here since I've tried multiple times to reproduce the issue and cannot. And therefore also can't try running with different GATK versions or options etc. But at the same time if it's possible for a hardware issue to cause these problems without crashing the GATK that's very scary.

The following is the logging prior to traversal so you can see which versions of various things are in use:

03:15:01.986 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:conda/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 09, 2020 3:15:02 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
03:15:02.169 INFO  HaplotypeCaller - ------------------------------------------------------------
03:15:02.170 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.4.1
03:15:02.170 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
03:15:02.170 INFO  HaplotypeCaller - Executing as <redacted> on Linux v4.4.0-1114-aws amd64
03:15:02.170 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_144-b01
03:15:02.170 INFO  HaplotypeCaller - Start Date/Time: October <redacted>
03:15:02.170 INFO  HaplotypeCaller - ------------------------------------------------------------
03:15:02.170 INFO  HaplotypeCaller - ------------------------------------------------------------
03:15:02.170 INFO  HaplotypeCaller - HTSJDK Version: 2.21.0
03:15:02.170 INFO  HaplotypeCaller - Picard Version: 2.21.2
03:15:02.170 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
03:15:02.171 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
03:15:02.171 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
03:15:02.171 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
03:15:02.171 INFO  HaplotypeCaller - Deflater: IntelDeflater
03:15:02.171 INFO  HaplotypeCaller - Inflater: IntelInflater
03:15:02.171 INFO  HaplotypeCaller - GCS max retries/reopens: 20
03:15:02.171 INFO  HaplotypeCaller - Requester pays: disabled
03:15:02.171 INFO  HaplotypeCaller - Initializing engine
03:15:02.438 INFO  FeatureManager - Using codec VCFCodec to read file dbsnp.vcf.gz
03:15:02.563 INFO  FeatureManager - Using codec BEDCodec to read file targets.bed
03:15:02.578 INFO  IntervalArgumentCollection - Processing <redacted> bp from intervals
03:15:02.588 INFO  HaplotypeCaller - Done initializing engine
03:15:02.590 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
03:15:02.593 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/conda/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
03:15:02.598 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/conda/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
03:15:02.599 INFO  IntelSmithWaterman - Using CPU-supported AVX-512 instructions
03:15:02.599 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
03:15:02.601 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
03:15:02.601 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
03:15:02.623 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:conda/share/gatk4-4.1.4.1-1/gatk-package-4.1.4.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
03:15:02.667 INFO  IntelPairHmm - Using CPU-supported AVX-512 instructions
03:15:02.667 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
03:15:02.667 INFO  IntelPairHmm - Available threads: 16
03:15:02.667 INFO  IntelPairHmm - Requested threads: 32
03:15:02.667 WARN  IntelPairHmm - Using 16 available threads, but 32 were requested
03:15:02.667 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

Any insight into what's going on and how to diagnose it would be greatly appreciated.

droazen commented 4 years ago

@tfenne We've had a couple of instances of non-determinism in the HaplotypeCaller in the past (https://github.com/broadinstitute/gatk/pull/6195, https://github.com/broadinstitute/gatk/pull/6104), but these resulted only in very minor differences in the output, and were patched in version 4.1.4.0. "A hardware issue that intermittently affects only AVX operations" is theoretically possible I suppose, but seems unlikely. Could you try re-running a bunch of times with DEBUG logging verbosity (--verbosity DEBUG), as well as the diagnostic arguments --bam-output, --assembly-region-out, and --debug-assembly-region-state (if that last argument exists in 4.1.4.1, not sure about that). If you get lucky and replicate the issue with these debug arguments on, the extra logging and output files would help us to attempt to diagnose the issue.

@jamesemery Could you provide your thoughts on this one? What could intermittently cause the allele likelihoods to be null in the annotation code?

ldgauthier commented 4 years ago

@tfenne GIven that you're seeing this 25% of the time (in this small sample) at least it's common enough we can do some efficient investigation. Have you tried without the AVX accelerated PairHMM? The Java LOGLESS_CACHING version is not hardware accelerated, so if you are having AVX issues, then I would expect that one to be successful 100% of the time. (It is significantly slower. By a lot.) You can toss in -pairHMM LOGLESS_CACHING to give it a shot.

tfenne commented 4 years ago

Thanks @droazen & @ldgauthier. I can certainly run a bunch more iterations of the same HC run on the same data. I'm not super hopeful it will turn anything up though. I can also try selecting a bunch of the different PairHMM implementations. I can't share too much, but this issue turned up in a very high throughput (1000s of samples a day) clinical pipeline. We're going back and looking for other instances where we see an excess of that Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null message, and re-running those samples to see if, on re-run, they generate different outputs.

I realize the AVX-specific hardware issue is perhaps a little far-fetched, though given the volume of the pipeline and the fact that it runs in a cloud environment, I think it's entirely reasonable to suspect we'll run into hardware/instance issues occasionally. And there are AVX or at least SIMD specific registers, so if one of those were to see problems that could cause the PairHMM issues, without causing issues in other software that doesn't leverage the SIMD/AVX instructions.

My main question really is this: is anyone familiar enough with the Intel PairHMM implementation and interface that they could weigh in on whether or not unexpected hardware errors could result in the return of empty likelihoods from the PairHMM instead of some kind of error, exception or segfault?

droazen commented 4 years ago

@tfenne I recommend asking that last question in the GKL repo: https://github.com/Intel-HLS/GKL -- I imagine that @mepowers or another member of her team could give you a more educated opinion than we could.

broadinstitute / gatk

Super-weird non-determinism in HaplotypeCaller in GATK 4.1.4.1 #6889