broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

make newQual the default for GenotypeGVCFs #4614

Closed lbergelson closed 5 years ago

lbergelson commented 6 years ago

People don't know to use new qual with GenotypeGVCFs so they're wasting a lot of time running the less efficient old qual. There are also people encountering bugs in old qual (see https://github.com/broadinstitute/gatk/issues/4544) We should consider making new qual the default and deprecating old qual.

@ldgauthier @davidbenjamin Thoughts?

ldgauthier commented 6 years ago

One last thing I wanted to do was to compare the allele-specific QUAL values between the old model and the new. Not that we have any hard filters for AS_QD, but I want to know how it changes. Maybe @skwalker can take a look based on the data we already have?

skwalker commented 6 years ago

@ldgauthier is this what you're looking for? (I only did sites where the values are different between the old model and the new)

qd_differences qual_differences

ldgauthier commented 6 years ago

That looks reasonable. Is this just for biallelics? My concern is that the values for biallelics won't have the same distribution as the (per-allele) values for multiallelics. That's going to involve some agonizing parsing in R unfortunately.

skwalker commented 6 years ago

@ldgauthier: That was both biallelics and multiallelics, if you separate by color you don't see much of a difference for QUAL and QD: qual_differences qd_differences

But there's definitely some differences with AS_QD ( assuming my R script is right.. ): as_qd_differences

ldgauthier commented 6 years ago

@davidbenjamin there are a couple discrepancies I'd like you to look into so we understand what's going on. There's a multi-allelic SNP at 1:148004722 where the AS_QD for the T goes from 30.83 with the old model to 0.49 with your new model. The genotype with that allele called is pretty high quality: 0/2:8,0,98:106:31:2688,2712,3038,0,326,31 so something in that high 20s low 30s range seems more reasonable to me. It's called in sample G01-GEA-30-HI with bam at /seq/picard_aggregation/C1710/G01-GEA-30-HI/v2/G01-GEA-30-HI.bam I didn't look up the sample with the other allele called, but I can get that for you. And @skwalker is putting together a combined GVCF with that site for debugging.

There are also an handful of biallelics that look really weird. For example, a singleton biallelic site with variant genotype 0/1:194,33:227:99:383,0,5597 has QD=1.56 for old and new, old AS_QD=1.56 and new AS_QD image With very few exceptions (like one I looked into above circa May 11) I expect AS_QD == QD for biallelics and the fact that that's true at this site for the old model but not the new is disconcerting. It's mostly AS_QD.new higher than the other values, which makes me wonder if there's something strange being covered up by jitter, like Infs being returned by the model.

skwalker commented 6 years ago

Hi @davidbenjamin ,

Here's the combined VCFs from running GenotypeGVCFs on just the first site (1:148004722) with and without the new qual argument:

/dsde/data/skwalker/qual_stuff/test/new_qual_1_148004722.vcf
/dsde/data/skwalker/qual_stuff/test/old_qual_1_148004722.vcf

and the results I get (slightly different from above):

CHROM POS ALT REF QUAL.old QUAL.new QD.old QD.new AS_QD.old AS_QD.new
1 148004722 G,T C 6968.13 6986 13.72 13.72 10.48,8.03 25.36,0.49

And the combined VCFs from running GenotypeGVCFs on just the second site (1:104297205) with and without the new qual argument:

/dsde/data/skwalker/qual_stuff/test/new_qual_1_104297205.vcf
/dsde/data/skwalker/qual_stuff/test/old_qual_1_104297205.vcf

and the results I get (again slightly different from above):

CHROM POS ALT REF QUAL.old QUAL.new QD.old QD.new AS_QD.old AS_QD.new
1 104297205 G C 333.16 354.06 1.47 1.56 1.47 25.36
davidbenjamin commented 6 years ago

The too-high AS_QD = 25.36 in both cases is coming from a line in new qual where if finite numerical precision leads to a log probability greater than 0, we set the allele-specific qual to be infinite. Then in the AS_QD code, this infinity is replace by 30 + jitter = 25.36, as Laura suspected. That can't be to hard to fix.

I still need to figure out the second case where new qual's AS_QD seems low.

davidbenjamin commented 5 years ago

I have a branch that fixes the AS_QD for both of those sites. @ldgauthier It did turn out to be numerical stability in log space. @skwalker Could you re-run with /dsde/working/davidben/new-qual-october-2018/new-qual-10-30-2018.jar?

ldgauthier commented 5 years ago

Good news. (Probably?) How do we know there aren't any other numerical instability issues lurking?

On Wed, Oct 31, 2018 at 11:46 AM David Benjamin notifications@github.com wrote:

I have a branch that fixes the AS_QD for both of those sites. @ldgauthier https://github.com/ldgauthier It did turn out to be numerical stability in log space. @skwalker https://github.com/skwalker Could you re-run with /dsde/working/davidben/new-qual-october-2018/new-qual-10-30-2018.jar?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4614#issuecomment-434735854, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdJEnlX6B-icEDNsVSE9W5arHa6IAks5uqcXOgaJpZM4TA5nv .

-- Laura Doyle Gauthier, Ph.D. Associate Director, Germline Methods Data Sciences Platform gauthier@broadinstitute.org Broad Institute of MIT & Harvard 320 Charles St. Cambridge MA 0214

davidbenjamin commented 5 years ago

I have a good feeling about numerical instability from this point forward because:

skwalker commented 5 years ago

@davidbenjamin I have to finish up some other work first, but then I will re run with your new jar

skwalker commented 5 years ago

@davidbenjamin results with the new qual jar

image

image image

davidbenjamin commented 5 years ago

@skwalker @ldgauthier should I submit a PR for this branch now?

ldgauthier commented 5 years ago

@skwalker can we get some density contours on those (at least the first two)? It's hard to say how much of a discrepancy there is.

skwalker commented 5 years ago

as_qd_differences qd_differences qual_differences

davidbenjamin commented 5 years ago

Closed by #5484.