broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

ReblockGVCF - Homozygous reference genotypes must contain GQ or PL #7797

Closed LiviaMoura closed 2 years ago

LiviaMoura commented 2 years ago

Bug Report

Affected tool(s) or class(es)

ReblockGVCF

Affected version(s)

Description

We ran ReblockGVCF in 549 samples with the newest GATK (4.2.6.1). 8 of them returned the error similar to the message below

org.broadinstitute.hellbender.exceptions.GATKException: Exception thrown at chrM:1 [VC /tmp/scratch/prs-sabe-files/GRAR/2031812880_AJ.hard-filtered.gvcf.gz @ chrM:1 Q. of type=SYMBOLIC alleles=[G*, <NON_REF>] attr={END=1} GT=[[2031812880_AJ G*/G* DP 1691 AD 112,1579 {MIN_DP=1691, SQ=0}]] filters=weak_evidence

and right below, we could find in all of them Caused by: org.broadinstitute.hellbender.exceptions.UserException$BadInput: Bad input: Homozygous reference genotypes must contain GQ or PL. Both are missing for hom ref genotype at chrM:1

All the "failed samples" produced a broken output, in this case, missing the chrM (and the alt chr, such as HLA, chr1_alt etc)... It was weird because on WDL it returned as Success job...

We need all the samples with a proper output to run the JointGenotype pipeline with the Reblocked Dragen samples output.

Steps to reproduce

I'll share with you the chrM:1 from GVCF from a sample with no error chrM 1 . G <NON_REF> . PASS END=72 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:2441,2:2443:99:1613:0,120,1800:0,255,255:40,13

And now, the chrM:1 from a sample with the error chrM 1 . G <NON_REF> . weak_evidence END=1 GT:AD:DP:SQ:MIN_DP 0/0:112,1579:1691:0:1691

Expected behavior

No broken output

Actual behavior

Failing in a few samples, breaking the expected output.

droazen commented 2 years ago

@ldgauthier Is this expected behavior in the new 4.2.6.1 release?

LiviaMoura commented 2 years ago

@ldgauthier hey Laura... A new update regarding this "topic"... As I said before (https://github.com/broadinstitute/gatk/pull/7725), when we were working with the ReblockGVCF from the snapshot you sent to us (https://console.cloud.google.com/gcr/images/broad-dsde-methods/US/gatk_subset_dragen_allele_frac@sha256:f5e93bda2278f1c999bd9def027c6851eeb098736b47a93469c524863b46c21f/details) and the JointGenotype pipeline (no Gnarly) using GATK 4.2.5, we could complete the analysis.

As we received the instruction to use GATK lastest (4.2.6.1) we tried to run the entire pipeline using the latest one (since ReblockGVCF til JointGenotype)...Unfortunately, it seems that the latest ReblockGVCF hasn't the needed changes to work completely with the JoinGenotype Pipeline (without Gnarly), because we received the error below, once again. I think maybe we'll need to run Snapshop reblock with the newest GATK for the other steps. Please, let me know if you think it's safe to use this approach (snapshot Reblock - 4.2.3~, plus GATK 4.2.6.1 entire JG pipeline)

Maybe this info can help regarding this open issue..

18:19:49.888 INFO  GenomicsDBImport - Importing batch 1 with 50 samples
18:19:53.652 erro  NativeGenomicsDB - pid=15219 tid=15235 conflicting field description in the vid JSON and the VCF header of file: 20210421-006_stream
terminate called after throwing an instance of 'VCFAdapterException'
what():VCFAdapterException : Conflicting field length descriptors and/or field lengths in the vid JSON and VCF header for field LOD
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
LiviaMoura commented 2 years ago

Update

We tried Snapshop Reblock + GATK 4.2.6.1 JointGenotype pipeline, and It broke again. Same error as above at "GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1" step....

=/

ldgauthier commented 2 years ago

Hi @LiviaMoura

I don't think I've ever seen mitochondrial calls from DRAGEN before. What version generated this data? Also how does the FORMAT header define SQ? I've never seen that before either.

LiviaMoura commented 2 years ago

Hi @ldgauthier

We have data that were generated from 3.6.3 and 3.8.4 (some tests on the latest 3.10)... This one is from 3.8.4 and the FORMAT from this gvcfs gives

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=ICNT,Number=2,Type=Integer,Description="Counts of INDEL informative reads based on the reference confidence model">
##FORMAT=<ID=MB,Number=4,Type=Integer,Description="Per-sample component statistics to detect mate bias">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PRI,Number=G,Type=Float,Description="Phred-scaled prior probabilities for genotypes">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias">
##FORMAT=<ID=SPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for SNPs based on the reference confidence model">
##FORMAT=<ID=SQ,Number=A,Type=Float,Description="Somatic quality">
##DRAGENCommandLine=<ID=HashTableBuild,Version="SW: 01.003.044.3.8.4, HashTableVersion: 8",CommandLineOptions="/opt/edico/bin/dragen --build-hash-table true --enable-cnv true --ht-alt-aware-validate true
##DRAGENCommandLine=<ID=dragen,Version="SW: 05.021.609.3.9.5, HW: 05.021.609",Date="Wed Feb 09 21:30:31 UTC 2022",CommandLineOptions="--bam-input s3://cromwell-dragen-us-east-1/samples/validacaoDRAGEN/ba

About chrM, yeah... all of them have this info... like below

grep "^chrM" <name>.hard-filtered.gvcf |  head  
chrM    1   .   G   <NON_REF>   .   weak_evidence   END=1   GT:AD:DP:SQ:MIN_DP  0/0:112,1579:1691:0:1691
chrM    2   .   A   <NON_REF>   .   PASS    END=72  GT:AD:DP:SQ:MIN_DP  0/0:2504,2:2506:99:1712
chrM    73  .   A   G,<NON_REF> .   PASS    DP=2017;MQ=208.10;FractionInformativeReads=0.948    GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  1/1:98.13,0.00:0,1912,0:1.000,0.000:0,980,0:0,932,0:1912:0,0,756,1156:0,0,932,980

As I replied to you, in another section, I can share some gvcfs from NA samples (Dragen v3.6.3) with you. Just let me know and I'll e-mail the links

ldgauthier commented 2 years ago

"Somatic quality" is 98.13, huh? I wonder what that means. Is it Phred-scaled? Is that a percent accuracy? It's not in the header.

GATK can't joint call the DRAGEN mitochondrial data. I will reach out to our contacts at Illumina to see what they recommend.

LiviaMoura commented 2 years ago

It's a type of quality metrics they designed - quotes below -(https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/dragen-bio-it/Illumina-DRAGEN-Bio-IT-Platform-User-Guide-1000000141465-00.pdf).

About the chrM, thank you.

Anyway, I don't think the "VCFAdapterException" error is related with the chrM, because the JoinGenotype pipeline worked fine with Reblock(snap) + GATK 4.2.5, and they were there (of course we know it was generating wrong data, but we had results to open and check in HAIL)

You can use somatic quality (SQ) as the primary metric to describe the confidence with which the caller
made a somatic call. SQ is reported as a format field for the tumor sample. Variants with SQ score
below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity
against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and
higher thresholds produce a more conservative caller.
QUAL is not output in the somatic variant records. Instead, the confidence score is FORMAT/SQ.
##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">
The field is specific to the sample. For the tumor samples, it quantifies the evidence that a somatic
variant is present at a given locus.
If a normal sample is also available, the corresponding FORMAT/SQ value quantifies the evidence that
the normal sample is a homozygous reference at a given locus.
ldgauthier commented 2 years ago

Now that we understand the issue a little better, I opened #7840