broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

make newQual the default for GenotypeGVCFs #4614

Closed lbergelson closed 5 years ago

lbergelson commented 6 years ago

People don't know to use new qual with GenotypeGVCFs so they're wasting a lot of time running the less efficient old qual. There are also people encountering bugs in old qual (see https://github.com/broadinstitute/gatk/issues/4544) We should consider making new qual the default and deprecating old qual.

@ldgauthier @davidbenjamin Thoughts?

davidbenjamin commented 6 years ago

@lbergelson We were just talking about this. Thanks to @skwalker's HaplotypeCaller tie out I think it's close. Question for the engine team: this is going to break several integration tests. What would you suggest doing?

ldgauthier commented 6 years ago

The other issue is that making newQual the default is going to introduce batch effects. @vdauwera Can correct me if I'm wrong, but the party line used to be "use the same version for all of your samples or we're not responsible". I do want to do it, I just don't know how we make it as obvious as possible. Would this justify a minor release? Or we could put one of those scary red warnings in the GenotypeGVCFs log for a few releases?

@davidbenjamin is still going to look into one last spanning deletion issue and we'll run a bunch of samples through HaplotypeCaller (since they share the same AlleleFrequencyCalculator) to make sure nothing chokes.

droazen commented 6 years ago

@davidbenjamin I do think we should throw this switch soon, perhaps even for the next release, provided we keep the old code path intact for now. There are two things that could affect the exact timing of the PR:

  1. I have a major HaplotypeCaller testing branch still in flight that I'd prefer to get merged before we throw the switch. I'd expect this to hit master sometime next week.

  2. There has been some talk of a supplemental evaluation to the palantir HC tie-out for the exomes on the cloud pipeline. I defer to @ldgauthier as to whether this change should wait until after that.

ldgauthier commented 6 years ago

If we keep the old code in tact as you propose then the first phase could be to swap the default from old to new and make -oldQual an argument, then it won't affect any potential HC tie-outs. David B. is going to do the 1000 exome scale test -- is that what you were thinking of as the supplemental evaluation? Or do we still need something for FE?

davidbenjamin commented 6 years ago

Update: the 1000 (okay, 856 to be precise) exomes test passed without a hiccup. After David R's big HC test PR goes in I will address this.

ldgauthier commented 6 years ago

What happened to the other 144?

davidbenjamin commented 6 years ago

They didn't exist -- there were 856 bams in the list.

sooheelee commented 6 years ago

Another option to consider is to force people to choose one or the other qual model with conscious thought by making the parameter default null.

lbergelson commented 6 years ago

I'm against making people think if the right answer is "use newQual"

sooheelee commented 6 years ago

Could be a temporary solution, as folks segway to newQual.

droazen commented 6 years ago

@davidbenjamin Out of curiosity, how were the 856 outputs evaluated? Did you check concordance against GATK3.x? If so, which fields were included in the concordance check? Did you keep track of runtimes? (I'd be curious to know if any particular inputs were outliers in overall runtime).

Also, would you mind pasting the full HaplotypeCaller command you ran with here? (including -Xmx, etc.). Thanks!

ldgauthier commented 6 years ago

@droazen are you worried that the Palantir HC 3v4 evaluation didn't include enough samples?

GATK 3.6(?) GVCFs for comparison are here: /seq/tng/methodsdev/data/sample_vcf//Exome/Homo_sapiensassembly19/.

It would probably be relatively quick to re-run @skwalker 's comparison from https://github.com/broadinstitute/palantir/tree/master/Analysis/519_v2_hc on those.

ldgauthier commented 6 years ago

@davidbenjamin did you use /seq/references/Homo_sapiens_assembly19/v1/variant_calling/exome_calling_regions.v1.interval_list for your calling intervals? I'm seeing some discrepancies in really high QUAL variants. And your genotypes don't have DP for some reason.

davidbenjamin commented 6 years ago

@ldgauthier I used /seq/references/Homo_sapiens_assembly19/v1/variant_calling/exome_evaluation_regions.v1.interval_list.

@droazen My command line was

gatk HaplotypeCaller -I ${bam} -R ${ref_fasta} ${"-L " + intervals} \
            -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation -new-qual \
            -O ${vcf}

Here's a histogram of total runtime in minutes minus pairHMM time minus Smith-Waterman time: times

davidbenjamin commented 6 years ago

@ldgauthier I have no idea about the DP. All I did was build master and run with new-qual!

ldgauthier commented 6 years ago

The DP discrepancy is on me: I forgot that it's in its own special group with the ClippingRankSum annotation. So the appropriate annotation args are actually -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation, but that's not a reason to rerun the data.

ldgauthier commented 6 years ago

For my HaplotypeCaller runs CHMI_CHMI3_Next1 took 2.71 CPU-hours, the mean over the largest callset was 11.8 CPU-hours and the median was 4.8 CPU-hours, so I probably have a long tail like David's. Although that's from a superset of David's samples.

droazen commented 6 years ago

Thanks for the additional info! To clarify, though, was any validation done on the GVCF outputs themselves to check concordance vs. GATK3? If not, can I suggest that we do such a validation? Being able to say that we have 856 validated runs of the GATK4 HC on exome data would greatly bolster the case for including it in the new exome pipeline.

droazen commented 6 years ago

Also @davidbenjamin: would you mind sending me the 5 exomes with the longest runtimes? I want to do a few experiments to check runtime vs. GATK3 on those samples, and see if the downsampling needs to be made more aggressive in 4.

ldgauthier commented 6 years ago

No validation here. I was satisfied with the validation from the Palantir report and using this as a robustness test to show that GATK4 HC isn't going to fall over.

I have a matched list of GVCFs here: /humgen/gsa-hpprojects/dev/gauthier/scratch/newQualHC/check.list @skwalker could you adapt your analysis to run with this list? I'll need to give you a different jar for the GenotypeGVCFs step on my GVCFs since the annotation format is outdated.

davidbenjamin commented 6 years ago

@droazen Here are the five slowest bams according to non-pairHMM and non-Smith Waterman time. With the exception of one these are also the five slowest according to total HaplotypeCaller time.

bam time, excluding pair HMM and Smith-Waterman, in minutes total HC time (minutes)
/seq/picard_aggregation/C1468/09C82222/v1/09C82222.bam 269 505
/seq/picard_aggregation/C1710/10C109033/v1/10C109033.bam 269 545
/seq/picard_aggregation/C1710/12171/v1/12171.bam 269 402
/seq/picard_aggregation/C1710/08C73332/v1/08C73332.bam 294 613
/seq/picard_aggregation/C1710/CC940.201/v1/CC940.201.bam 315 496

The slowest bam in total time was /seq/picard_aggregation/C1468/DEASD_0384_001/v1/DEASD_0384_001.bam, which took 1144 minutes.

ldgauthier commented 6 years ago

That's not an entirely fair comparison because DEASD_0384_001 is on fargo. @davidbenjamin can you run a readlink -f on your paths and take the slowest over the on prem (if there are any left)?

droazen commented 6 years ago

I imagine that @skwalker's scripts could be adapted for the task -- I'll try to set up a meeting with her next week to discuss.

davidbenjamin commented 6 years ago

@ldgauthier Just checked; they're all on fargo.

chandrans commented 6 years ago

Hi everyone. I am wondering if using newQual in GenotypeGVCFs is okay if the user did not specify it in HaplotypeCaller?

ldgauthier commented 6 years ago

Totally fine. The QUAL score in per-sample GVCFs gets ignored and recalculated in GenotypeGVCFs.

chandrans commented 6 years ago

Thanks Laura! good to know

skwalker commented 6 years ago

@ldgauthier I have some results for the two sets of gvcfs you sent me.

The GC results look good between : sensitivity_specificity_plot

The differences in annotations don't seem too bad: 13_annotation_differences

I have 15 other plots like that if you want to take a look (I had to do separate plots for 50 samples at a time because doing them all together broke R).

And there's roughly 7 different genotypes for each sample across the two VCFs: UniqueGTDifferences.xlsx

ldgauthier commented 6 years ago

I think these look perfectly fine. HC newQual gets a +1 from me. By my accounts, there are about 20-30K variants per sample, so 7 discrepancies per sample is just noise.

I spot checked some of the specific differences in the spreadsheet to see what's going on there. In a bunch of cases, GATK3 calls a genotyping including the non-ref allele, which is embarrassing, but the issue is fixed with newQual in GATK4! Also, about half of these are on chromosome 6, many in HLA regions (chr 6 ~32M), which we've never claimed to do well on. A couple I looked at are in repetitive regions where the assembly varies based on the exact intervals. There was one in my nemesis gene MUC6 where the old qual model dropped an alt at a triallelic site -- GATK4 prevails. There was one site I couldn't explain, but both GATK3 and GATK4 contradicted the pileup in different ways.

droazen commented 6 years ago

@ldgauthier Are these results sufficient to convince you to use the GATK4 HaplotypeCaller in the upcoming exomes on the cloud pipeline? Is it worth setting up a meeting later this week to go over @skwalker 's results in further detail?

ldgauthier commented 6 years ago

I still want @davidbenjamin to double check the handling in his code (i.e. we shouldn't count it as a real allele since it's echoing an upstream event that's already been evaluated -- otherwise we could have a lot of high quality only variants where the triggering SNP is too low quality, which would be silly).

@skwalker I would like to do one more thing to check on the allele-specific annotations. Can you genotype all David B.'s GATK4 GVCFs together using GenomicsDBImport? We can use this to test #3707 on a larger callset and then compare the AS vs. non-AS output at biallelic sites.

skwalker commented 6 years ago

@ldgauthier sorry for the delay, I've plotted the AS vs non-AS at biallelic sites.

baseqranksum inbreedingcoeff mqranksum qd readposranksum

ldgauthier commented 6 years ago

Thanks for the plots @skwalker ! To clarify, this is the AS- vs non-AS- annotations from the same callset at biallelic sites, right? I'm not surprised there are some differences, but it's hard to tell how significant they are at full alpha. Can you turn down the alpha on the geom_point or add some contours?

skwalker commented 6 years ago

Yep, these are from the same callset. I've added some contours and lowered the alpha here baseqranksum inbreedingcoeff mqranksum qd readposranksum

ldgauthier commented 6 years ago

Beautiful! The contours seem to imply that there aren't that many crazy rank sum outliers. I'd like to know what some of them are, though. Can you pick a |ReadPosRankSum - AS_ReadPosRankSum| > 8 position and give me the SelectVariants output from the GDB at that position? I'm hoping it's because in the AS case the data is split by allele, but then the crappy allele(s) get dropped so it looks biallelic but isn't using the same data.

Ditto for |AS_QD - QD| > 20. And can you plot AS_MQ vs. MQ, AS_SOR vs SOR and AS_FS vs. FS too? I suspect MQ will have the same behavior (high density of matching with some weird outliers), but I expect FS and SOR to be spot on.

davidbenjamin commented 6 years ago

4801 addressed the spanning deletion issue with new qual, so I'm ready to go ahead and do this once a green light is given. @droazen you wanted to wait for your HC tests branch to go in, right?

ldgauthier commented 6 years ago

...at which point we will write RELEASE NOTES with warnings about batch effects in BIG LETTERS!

skwalker commented 6 years ago

Here are the rest of the plots... not quite what you were expecting..

fs mq sor

ldgauthier commented 6 years ago

Well. That MQ plot is... special. It's a known issue that there's some hinky (non-AS) MQ behavior in regions with uninformative reads (I made an approximation that doesn't hold as often as I'd hoped). I'm working on a branch to fix it literally right now. Also, (non-AS) MQ uses all the reads, whereas AS_MQ only uses the informative ones, so many sites will have slight differences and some sites (gross indels and the like) will have larger differences.

I suspect that the sites that have drastically different AS vs non-AS annotations will probably be common across all annotation types, but we'll see after you can pull out a few examples.

droazen commented 6 years ago

@davidbenjamin Yes, that's correct -- the HC testing branch is coming soon, and I think we should be able to get both that branch and the switch to newQual in for the next release (assuming that there are no outstanding objections from anyone else on the switch).

ldgauthier commented 6 years ago

Sarah dug up variants with big annotation differences for me and they confirm my hypothesis about dropping alleles in genotyping.

Raw data for ReadPosRankSum(1.330) vs AS_ReadPosRankSum(-13.4): 9 36651842 . C A,T,<NON_REF> . . AS_RAW_BaseQRankSum=|-10.500,1|-1.700,1|;AS_RAW_MQRankSum=|-0.600,1|0.100,1|;AS_RAW_ReadPosRankSum=|-13.400,1|1.300,1|;AS_SB_TABLE=460,384|0,86|1,1|0,0; BaseQRankSum=-1.661e+00;MQRankSum=0.150;ReadPosRankSum=1.33 (some annotations removed for clarity) Genotypes with any alternate reads are: GT:AD:PL ./.:567,86,0,0:45,0,18349,1741,18594,20335,1741,18594,20335,20335 ./.:277,0,2,0:0,829,9096,785,9053,9047,829,9096,9053,9096 The genotyped VC has an A as an alt with AC=1 (and from the second genotype you can see the T is never called, it just has some reads in one sample so it gets output by HC.) The A is pretty low quality anyway -- all 86 reads are in one direction (from the AS_SB_TABLE). Anyway, to rationalize the RankSum discrepancies, when the GVCFs get merged, we take the median of the non-AS rank sums (here the two samples with alt reads have distinct alts so the ASRAW*RankSum list is effectively the same as the combined non-AS rank sums), which in our implementation is the greater of the two middle values for a even-length list (see Utils::getMedianValue). So the non-AS rank sums end up getting the better of the two values, which aren't even derived from the data they're supposed to represent. Alleles for the win! (The values are slightly different because the AS values get floored to the nearest tenth as part of the annotation.)

Raw data for QD(34.34) vs AS_QD (2.83): X 13681128 . G A,C,<NON_REF> . . AS_RAW_BaseQRankSum=|-1.900,1,-1.800,1||;AS_RAW_MQRankSum=|-5.800,1,0.300,1||;AS_RAW_ReadPosRankSum=|1.000,1,1.200,1||;AS_SB_TABLE=99,68|2,2|35,20|0,0; BaseQRankSum=-1.783e+00;MQRankSum=0.302;RAW_MQ=800193.00;ReadPosRankSum=1.29 Genotypes with alt reads here are: GT:AD:PL ./.:53,2,0,0:0,116,1821,159,1827,1870,159,1827,1870,1870 ./.:0,0,55,0:1947,1947,1947,165,165,0,1947,1947,165,1947 ./.:114,2,0,0:0,299,3972,342,3978,4021,342,3978,4021,4021 In the genotyped VC the A gets dropped and C gets AC=2 (one homVar). Again, we're seeing representation of the dropped alt only in homRef samples. The non-AS rank sum behavior here sucks too, but we're after QD. Unfortunately, most of the work for QD gets done inside GenotypeGVCFs, so I had to debug. AS_QD uses the per-allele probability of AC=0 (so it allows for AC>0 for other alts), which is the same as the probability of all samples being homRef if there's only one alt. In cases like this where there's another alt with some weight in the PLs (however small) then those probabilities will not be the same. Honestly I'm surprised by how different they are -- -188.9 for the log10(P(all homRef)) vs -15.55 for the log10(P(high quality alt AC = 0))

In summary, these data show that the discrepancies in a small subset of cases are expected because biallelics will be the same, but those discrepant sites are not true biallelics. I will also look into the AS_QD calc for multi-allelics. Clearly from the cloud around 30 the AS_QD is maxing out in some cases. This is my opportunity to fix that and finally remove the jitter.

ldgauthier commented 6 years ago

The new model effectively increases QUAL scores by 20 (I wrote this up in the NewQualQC2 Palantir report), but the variants that are between the emission threshold and the equivalent emission threshold compared to the old model aren't great: image Those few good values out in the tail are potentially good variants, but with 2 or 1 alt reads, which most users filter. The rest of them have crappy allele balance.

When we use new qual we should use -stand-call-conf 30 instead of the current default of 10.

davidbenjamin commented 6 years ago

This difference of 20 makes sense theoretically. The old qual always has a heterozygosity prior of 1 in 1000, whereas new qual starts there but is willing to learn a different allele frequency. Very hand-wavingly, a difference of 20 or so means that new qual has learned an allele frequency in the ballpark of 1 in 10 (since log_10 1/10 = log_10 1/1000 + 2 and 2 --> 20 in phred scale). So basically, they only squeak by the old threshold because new qual has learned the artifact allele frequency as a true allele frequency. :thumbsup: for stand-call-conf 30

ldgauthier commented 6 years ago

I'm missing something here. 10% allele frequency is really high in practice. The AN varies pretty substantially because exome coverage is so uneven (and I think there are two different captures here), but the AF for most variants in this callset is less than 1 in 1000 (tail trimmed): image Or maybe your prior requires a lot more evidence from the homRefs than my callset has?

Also, is it unreasonable to expect this ratio of model priors to be similar in different sized callsets?

ldgauthier commented 6 years ago

There are also some discrepancies Sarah saw that stem from hom ref calls that get annotated because they have an alt allele with a minuscule posterior probability that's still > 0 (where 0 is the emission threshold in GVCF mode). This changes the results in some cases and will also break homRef blocks wherever there are any alt reads at all, which I don't want to do.

AFCalculationResults::isPolymorphic is returning true because the log10 of this tiny probability is still less than zero. The old model appears to flush to zero. I haven't dug into the old code to try to figure out what its threshold is, but in this case the new model's log10 probability is -2e-159 -- pretty tiny.

If it's computationally beneficial to flush to zero then we can do that, but otherwise we can add a little epsilon to the isPolymorphic comparison with the threshold. @davidbenjamin thoughts?

davidbenjamin commented 6 years ago

I'm missing something here. . .but the AF for most variants in this callset is less than 1 in 1000.

@ldgauthier In that case I'll have to come up with a better hand-waving explanation!

otherwise we can add a little epsilon to the isPolymorphic comparison with the threshold.

That seems reasonable.

ldgauthier commented 6 years ago

As per a recent meatspace conversation @davidbenjamin said he would throw together a quick PR for the epsilon. And maybe rerun the data?

davidbenjamin commented 6 years ago

AFCalculationResults::isPolymorphic() is returning true because the log10 of this tiny probability is still less than zero. . . If it's computationally beneficial to flush to zero then we can do that, but otherwise we can add a little epsilon to the isPolymorphic comparison with the threshold.

@ldgauthier Actually, I'm confused about this epsilon. It seems like AFCalculationResults::isPolymorphic() gets called with STANDARD_CONFIDENCE_FOR_CALLING as the threshold. @skwalker do you have an example of where this occurs that I could debug?

ldgauthier commented 6 years ago

Since we're comparing the GVCF mode results, STANDARD_CONFIDENCE_FOR_CALLING is set to zero.

I was debugging with java -jar $GATKjar HaplotypeCaller -ERC GVCF -I /seq/picard_aggregation/C1710/G01-GEA-96-HI/v1/G01-GEA-96-HI.bam -R $b37 -O homRefDiff.vcf -L 17:79612606 -new-qual -ip 300

davidbenjamin commented 6 years ago

I'm about to give you a PR for this. It cleaned up the output for that bam as expected.