Closed lbergelson closed 5 years ago
One last thing I wanted to do was to compare the allele-specific QUAL values between the old model and the new. Not that we have any hard filters for AS_QD, but I want to know how it changes. Maybe @skwalker can take a look based on the data we already have?
@ldgauthier is this what you're looking for? (I only did sites where the values are different between the old model and the new)
That looks reasonable. Is this just for biallelics? My concern is that the values for biallelics won't have the same distribution as the (per-allele) values for multiallelics. That's going to involve some agonizing parsing in R unfortunately.
@ldgauthier: That was both biallelics and multiallelics, if you separate by color you don't see much of a difference for QUAL and QD:
But there's definitely some differences with AS_QD ( assuming my R script is right.. ):
@davidbenjamin there are a couple discrepancies I'd like you to look into so we understand what's going on. There's a multi-allelic SNP at 1:148004722 where the AS_QD for the T goes from 30.83 with the old model to 0.49 with your new model. The genotype with that allele called is pretty high quality: 0/2:8,0,98:106:31:2688,2712,3038,0,326,31
so something in that high 20s low 30s range seems more reasonable to me. It's called in sample G01-GEA-30-HI with bam at /seq/picard_aggregation/C1710/G01-GEA-30-HI/v2/G01-GEA-30-HI.bam
I didn't look up the sample with the other allele called, but I can get that for you. And @skwalker is putting together a combined GVCF with that site for debugging.
There are also an handful of biallelics that look really weird. For example, a singleton biallelic site with variant genotype 0/1:194,33:227:99:383,0,5597
has QD=1.56 for old and new, old AS_QD=1.56 and new AS_QD
With very few exceptions (like one I looked into above circa May 11) I expect AS_QD == QD for biallelics and the fact that that's true at this site for the old model but not the new is disconcerting. It's mostly AS_QD.new higher than the other values, which makes me wonder if there's something strange being covered up by jitter, like Infs being returned by the model.
Hi @davidbenjamin ,
Here's the combined VCFs from running GenotypeGVCFs on just the first site (1:148004722) with and without the new qual argument:
/dsde/data/skwalker/qual_stuff/test/new_qual_1_148004722.vcf
/dsde/data/skwalker/qual_stuff/test/old_qual_1_148004722.vcf
and the results I get (slightly different from above):
CHROM | POS | ALT | REF | QUAL.old | QUAL.new | QD.old | QD.new | AS_QD.old | AS_QD.new |
---|---|---|---|---|---|---|---|---|---|
1 | 148004722 | G,T | C | 6968.13 | 6986 | 13.72 | 13.72 | 10.48,8.03 | 25.36,0.49 |
And the combined VCFs from running GenotypeGVCFs on just the second site (1:104297205) with and without the new qual argument:
/dsde/data/skwalker/qual_stuff/test/new_qual_1_104297205.vcf
/dsde/data/skwalker/qual_stuff/test/old_qual_1_104297205.vcf
and the results I get (again slightly different from above):
CHROM | POS | ALT | REF | QUAL.old | QUAL.new | QD.old | QD.new | AS_QD.old | AS_QD.new |
---|---|---|---|---|---|---|---|---|---|
1 | 104297205 | G | C | 333.16 | 354.06 | 1.47 | 1.56 | 1.47 | 25.36 |
The too-high AS_QD = 25.36
in both cases is coming from a line in new qual where if finite numerical precision leads to a log probability greater than 0, we set the allele-specific qual to be infinite. Then in the AS_QD
code, this infinity is replace by 30 + jitter = 25.36, as Laura suspected.
That can't be to hard to fix.
I still need to figure out the second case where new qual's AS_QD seems low.
I have a branch that fixes the AS_QD
for both of those sites. @ldgauthier It did turn out to be numerical stability in log space. @skwalker Could you re-run with /dsde/working/davidben/new-qual-october-2018/new-qual-10-30-2018.jar
?
Good news. (Probably?) How do we know there aren't any other numerical instability issues lurking?
On Wed, Oct 31, 2018 at 11:46 AM David Benjamin notifications@github.com wrote:
I have a branch that fixes the AS_QD for both of those sites. @ldgauthier https://github.com/ldgauthier It did turn out to be numerical stability in log space. @skwalker https://github.com/skwalker Could you re-run with /dsde/working/davidben/new-qual-october-2018/new-qual-10-30-2018.jar?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4614#issuecomment-434735854, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdJEnlX6B-icEDNsVSE9W5arHa6IAks5uqcXOgaJpZM4TA5nv .
-- Laura Doyle Gauthier, Ph.D. Associate Director, Germline Methods Data Sciences Platform gauthier@broadinstitute.org Broad Institute of MIT & Harvard 320 Charles St. Cambridge MA 0214
I have a good feeling about numerical instability from this point forward because:
@davidbenjamin I have to finish up some other work first, but then I will re run with your new jar
@davidbenjamin results with the new qual jar
@skwalker @ldgauthier should I submit a PR for this branch now?
@skwalker can we get some density contours on those (at least the first two)? It's hard to say how much of a discrepancy there is.
Closed by #5484.
People don't know to use new qual with GenotypeGVCFs so they're wasting a lot of time running the less efficient old qual. There are also people encountering bugs in old qual (see https://github.com/broadinstitute/gatk/issues/4544) We should consider making new qual the default and deprecating old qual.
@ldgauthier @davidbenjamin Thoughts?