broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
176 stars 88 forks source link

Handle when DP is missing (depth displayed as 32,000) #2355

Open jxchong opened 2 years ago

jxchong commented 2 years ago

Describe the bug Read depth for a variant is displayed in seqr as 32000 when per-sample DP is missing (e.g. "." instead of a value). To be clear, this is a bit of an edge case but I guess you might want to consider catching such errors rather than displaying a number that I assume comes from importing the null/missing DP. Might be worth throwing a warning when trying to import an affected VCF and coding something so it doesn't display read depth at 32000 without explanation?

Background: For a period of time in ~2016, there was a version of GATK that had a bug* that resulted in occasionally in the per-sample DP field being left empty even though a genotype existed and had GQ and AD values. We recently wanted to use seqr to analyze an old VCF that happened to be generated by that buggy version of GATK and it turns on read depth is reported as 32000 for such values.

The specifics of the bug aren't super relevant to seqr so I didn't spend a lot of time digging up the details. I can't find the exact issue any longer because it's in the archived GATK forums/issues. Here's a thread that references it https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2016-08-11-2016-04-07/7412-Best-strategy-to-fix-the-Haplotype-Caller-GenotypeGVCF-missing-DP-field-bug

Link to page(s) where bug is occurring N/A our own instance

Scope of the bug If the project contains an affected VCF (per-sample DP == ".")

Screenshots If applicable, add screenshots to help explain your problem.

Screen Shot 2021-12-16 at 2 52 51 PM
hanars commented 2 years ago

it turns on read depth is reported as 32000 for such values.

So if the GATK VCF has a DP of 32000, that is what seqr is going to show. I don't think it is a good use of our teams time to invest effort in catching an edge case that occurs in old, buggy VCFs. This is something you can quite easily handle on your own in hail by preprocessing the VCF to drop these values. seqr should handle true nulls for DP just fine but simply not displaying anything.

jxchong commented 2 years ago

The VCF has "." in the per-sample DP field, not 32,000.

hanars commented 2 years ago

Ah I misunderstood, I'll reopen the issue as a feature request to handle '.' values for DP, as this is still requesting new functionality as current GATK VCFs have never had these values. I am curious, then, where the value 32000 is coming from?

I still think the fastest solution for you would be to drop those values, as the person who works on the pipeline is out on paternity leave so this is unlikely to be fixed anytime soon

jxchong commented 2 years ago

Thanks, agree we don't need an immediate solution to move forward but wanted to report it to you since it's unexpected behavior.

Here's a snippet of the VCF.

tabix xxxx.vcf.gz 8:118830690-118830691 8 118830690 . A G 375.19 PASS AC=1;AF=0.25;AN=4;BaseQRankSum=-1.485;DP=70;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.25;MQ=60;MQRankSum=0;QD=11.72;ReadPosRankSum=-0.452;SOR=0.512GT:AD:DP:GQ:PL 0/0:37,0:37:99:0,108,1266 0/1:17,15:.:99:405,0,494

jxchong commented 2 years ago

Just a thought, 32000 seems awfully suspiciously like a number that might be the max supported value for some sort of integer variable.

hanars commented 2 years ago

I did fine where in the pipeline we are setting 32000 as the default: https://github.com/broadinstitute/hail-elasticsearch-pipelines/blob/master/luigi_pipeline/lib/model/seqr_mt_schema.py#L282