bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Joint calling anomalies #3520

Open DrMcStrange opened 3 years ago

DrMcStrange commented 3 years ago

I've come across some strange results while filtering variants from a GATK germline small variant calling run. The full run included 57 WGS and 11 WES samples. In one region on chromosome 1, the final VCF only gives genotypes for the WES - all WGS samples show as missing. Here's an example of a variant in the region:

chr1 21873401 rs756728010 T G 36.7 PASS AC=2;AF=0.091;AN=22;BaseQRankSum=-2;ClippingRankSum=1.68;DP=826;ExcessHet=3.2222;FS=24.888;InbreedingCoeff=-0.0967;MLEAC=2;MLEAF=0.091;MQ=60;MQ0=0;MQRankSum=0;QD=0.75;ReadPosRankSum=1.04;SOR=3.588;VQSLOD=-13.46;culprit=QD;ANN=G|missense_variant|MODERATE|HSPG2|ENSG00000142798|transcript|ENST00000374695.7|protein_coding|30/97|c.3767A>C|p.Asn1256Thr|3847/14327|3767/13176|1256/4391||,G|missense_variant|MODERATE|HSPG2|ENSG00000142798|transcript|ENST00000427897.1|non_stop_decay|4/5|c.329A>C|p.Asn110Thr|330/589|329/588|110/195||WARNING_TRANSCRIPT_NO_START_CODON,G|downstream_gene_variant|MODIFIER|HSPG2|ENSG00000142798|transcript|ENST00000498495.1|retained_intron||n.1077A>C|||||1077|,G|downstream_gene_variant|MODIFIER|HSPG2|ENSG00000142798|transcript|ENST00000480900.1|retained_intron||n.4113A>C|||||4113|;ac_exac_all=86;an_exac_all=120530;ac_adj_exac_afr=0;an_adj_exac_afr=10366;ac_adj_exac_amr=5;an_adj_exac_amr=11574;ac_adj_exac_eas=11;an_adj_exac_eas=8642;ac_adj_exac_fin=45;an_adj_exac_fin=6108;ac_adj_exac_nfe=19;an_adj_exac_nfe=66436;ac_adj_exac_oth=1;an_adj_exac_oth=894;ac_adj_exac_sas=5;an_adj_exac_sas=16510;num_exac_Het=86;num_exac_Hom=0;rs_ids=rs756728010;af_exac_all=0.00071352;af_adj_exac_afr=0;af_adj_exac_amr=0.000432;af_adj_exac_eas=0.0012729;af_adj_exac_fin=0.0073674;af_adj_exac_nfe=0.00028599;af_adj_exac_oth=0.0011186;af_adj_exac_sas=0.00030285;max_aaf_all=0.0073674 GT:AD:DP:GQ:PL .:0,0:.:.:. .:0,0:.:.:. 0/0:69,0:69:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. 0/0:106,0:106:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. 0/0:80,0:80:99:0,120,1800 0/0:94,0:94:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. 0/0:98,0:98:99:0,120,1800 0/0:63,0:63:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. .:0,0:.:.:. 0/1:25,3:28:42:42,0,1017 .:0,0:.:.:. .:0,0:.:.:. 0/1:19,2:21:12:12,0,762 .:0,0:.:.:. .:0,0:.:.:. 0/0:82,0:82:99:0,120,1800 .:0,0:.:.:. 0/0:101,0:101:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:. 0/0:79,0:79:99:0,120,1800 .:0,0:.:.:. .:0,0:.:.:.

Looking at the gVCF for one of the missing WGS samples, the variant is called as 0/0 with good coverage and quality:

chr1 21873399 . G . . END=21873439 GT:DP:GQ:MIN_DP:PL 0/0:38:91:34:0,91,1365

The gVCFs for the WES samples show the variant called as 0/1, but with quite uneven allele depth:

chr1 21873401 . T G, 34.64 . BaseQRankSum=-2.360;ClippingRankSum=-1.303;DP=31;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MMQ=60,60,60;MQ0=0;MQRankSum=0.000;RAW_MQandDP=111600,31;ReadPosRankSum=1.040 GT:AD:DP:GQ:PL:SB 0/1:25,3,0:28:42:42,0,1017,117,1026,1144:6,19,3,0

Oddly, a nearby variant is called in the same WES gVCF with similar stats, but doesn't show up in the main VCF:

chr1 21873392 . T C, 0 . BaseQRankSum=-1.244;ClippingRankSum=-1.067;DP=30;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;MMQ=60,60,60;MQ0=0;MQRankSum=0.000;RAW_MQandDP=108000,30;ReadPosRankSum=0.433 GT:AD:DP:GQ:PL:SB 0/0:29,1,0:30:47:0,47,1228,87,1231,1271:11,18,1,0

Looking at bcbio-nextgen-commands.log, the region in the main VCF for which the WES samples are missing corresponds to one of the calling regions used by the GATK commands:

[2021-01-19T06:42Z] rn246: unset JAVAHOME && export PATH=/share/apps/rosalind/bcbio-nextgen/1.2.4/anaconda/bin:"$PATH" && gatk --java-options '-Xms681m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/scratch/bennetm/bcbio-cataract/cataract-all-202012/work/bcbiotx/tmplracp9u' HaplotypeCaller -R /share/apps/rosalind/bcbio-nextgen/1.2.4/genomes/Hsapiens/hg38/seq/hg38.fa --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I /scratch/bennetm/bcbio-cataract/cataract-all-202012/work/align/CGW73_02/CGW73_02-sort.bam -L /scratch/bennetm/bcbio-cataract/cataract-all-202012/work/gatk-haplotype/chr1/CGW73_02-chr1_16562956143289408-regions.bed --interval-set-rule INTERSECTION --annotation ClippingRankSumTest --annotation DepthPerSampleHC --native-pair-hmm-threads 1 --emit-ref-confidence GVCF -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 60 -GQB 80 -ploidy 2 --output /scratch/bennetm/bcbio-cataract/cataract-all-202012/work/bcbiotx/tmplracp9u/CGW73_02-chr1_16562956_143289408.vcf.gz

So far I've only spotted one region where this has happened. There may be more, but most of the genome appears to be fine.

So, any ideas as to what might have happened here? There seem to be two anomalies: all the WGS calls failing to make it from the gVCFs to the main VCF, and inconsistent calling of variants in the WES samples. I'm at a loss to explain either of them.

Attached are the bcbio-nextgen.log and bcbio-nextgen-commands.log. Unfortunately I'd already deleted work before spotting this, so I don't have bcbio-nextgen-debug.log. The run had a couple of false starts due to silly mistakes (wrong BED files for the exomes), but I don't see any way those could have led to these results.

bcbio-nextgen.log bcbio-nextgen-commands.log

DrMcStrange commented 3 years ago

Okay, I've now checked with colleagues who have run GATK germline small variant calling runs, and their final VCFs are missing calls for the same region on chr1. These are runs with hg38 as reference - I've also checked some hg19 runs and they're unaffected.

We still have all the intermediate files for some runs. Where would be the best place to begin tracking this down?

DrMcStrange commented 2 years ago

In case anyone's interested, I've now done some testing on 1.2.9 and this issue appears to be resolved.

Still, I'm curious to know what caused this, and whether anyone else ran into the same issue with GATK joint calling. From my earlier testing, it affected joint calling (but not population calling) on at least versions 1.2.4 and 1.2.8.

DrMcStrange commented 2 years ago

Cancel that last comment - turns out I was looking at batch calling results, not joint calling. My joint calling tests are currently crashing, so the issue may still exist.