Open tfenne opened 4 years ago
@tfenne We've had a couple of instances of non-determinism in the HaplotypeCaller in the past (https://github.com/broadinstitute/gatk/pull/6195, https://github.com/broadinstitute/gatk/pull/6104), but these resulted only in very minor differences in the output, and were patched in version 4.1.4.0. "A hardware issue that intermittently affects only AVX operations" is theoretically possible I suppose, but seems unlikely. Could you try re-running a bunch of times with DEBUG logging verbosity (--verbosity DEBUG
), as well as the diagnostic arguments --bam-output
, --assembly-region-out
, and --debug-assembly-region-state
(if that last argument exists in 4.1.4.1, not sure about that). If you get lucky and replicate the issue with these debug arguments on, the extra logging and output files would help us to attempt to diagnose the issue.
@jamesemery Could you provide your thoughts on this one? What could intermittently cause the allele likelihoods to be null in the annotation code?
@tfenne GIven that you're seeing this 25% of the time (in this small sample) at least it's common enough we can do some efficient investigation. Have you tried without the AVX accelerated PairHMM? The Java LOGLESS_CACHING version is not hardware accelerated, so if you are having AVX issues, then I would expect that one to be successful 100% of the time. (It is significantly slower. By a lot.) You can toss in -pairHMM LOGLESS_CACHING
to give it a shot.
Thanks @droazen & @ldgauthier. I can certainly run a bunch more iterations of the same HC run on the same data. I'm not super hopeful it will turn anything up though. I can also try selecting a bunch of the different PairHMM implementations. I can't share too much, but this issue turned up in a very high throughput (1000s of samples a day) clinical pipeline. We're going back and looking for other instances where we see an excess of that Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
message, and re-running those samples to see if, on re-run, they generate different outputs.
I realize the AVX-specific hardware issue is perhaps a little far-fetched, though given the volume of the pipeline and the fact that it runs in a cloud environment, I think it's entirely reasonable to suspect we'll run into hardware/instance issues occasionally. And there are AVX or at least SIMD specific registers, so if one of those were to see problems that could cause the PairHMM issues, without causing issues in other software that doesn't leverage the SIMD/AVX instructions.
My main question really is this: is anyone familiar enough with the Intel PairHMM implementation and interface that they could weigh in on whether or not unexpected hardware errors could result in the return of empty likelihoods from the PairHMM instead of some kind of error, exception or segfault?
@tfenne I recommend asking that last question in the GKL repo: https://github.com/Intel-HLS/GKL -- I imagine that @mepowers or another member of her team could give you a more educated opinion than we could.
This is more of a question than an outright bug report. I've observed something very strange today, that I cannot reproduce, and am looking for some help figuring out what's going on.
The long and the short of it is that I'm operating on a commercial platform where I've run the same job 4 times. I can see logs and confirm that a) the exact same docker image is used for all four runs, b) the exact same GATK command is use for all 4 runs, and c) the exact same inputs are provided to each of the four runs. I can't share details (yet) but I'm 99.99% confident that I'm executing the exact same code on the exact same input data and getting quite different results.
Specifically the first job produces output that is different from the remaining three jobs, which are all identical (except for datetimes in the headers of VCFs).
The outlier run misses a number variants (about 10% vs. the other three runs). And the entries in the gVCF where the variants are missed are weird. E.g. there'll be a gVCF entry for a single base where if you believed the data in the gVCF here would be no reason to emit a separate block. And that entry will have high coverage (e.g. DP=800), assign all the coverage to the REF allele (the site is clearly about 50/50 het in IGV) and emit GQ=0 for GT=0/0.
One very noticeable difference is that the three "good" runs complete traversal without any warnings, but that "bad" run emits the following warning once:
and the following warning many times (~350):
I have a theory about what's going on, and I'm hoping someone who is more knowledgable can tell me if my theory is sensible or impossible, and if there's anything I can do to confirm it. My theory is this: that a) the one bad job got run on a compute instance that has a hardware issue that intermittently affects only AVX operations, b) that the Intel native PairHMM doesn't handle that situation gracefully but instead returns an empty likelihoods map and c) that's causing the warnings I'm seeing the discrepancies in the gVCFs.
I'm at a bit of a loss for what to do here since I've tried multiple times to reproduce the issue and cannot. And therefore also can't try running with different GATK versions or options etc. But at the same time if it's possible for a hardware issue to cause these problems without crashing the GATK that's very scary.
The following is the logging prior to traversal so you can see which versions of various things are in use:
Any insight into what's going on and how to diagnose it would be greatly appreciated.