We found a large number of PASS frameshift indels that looked suspicious in our pan-cancer germline callset (~37k samples) that are driving the top genes in some of our case-control associations. These indels all have the following properties:
They are flagged as PASS by VQSR and are PASS in ExAC; but examination in IGV shows that these variants are all present at extremely low allelic fractions. (These are all normal samples.)
These variants occur in the specific context of 7-base (and in one case, 8-base) homopolymer runs.
The variant we looked at in the meeting had a very low QD score, and Eric Banks's suggestion for the short-term is to filter our callset for low QD indels (i.e., remove all indels with QD<1). The hard filter will take care of most of these faulty variants, but at least two of the suspicious indels have QD scores > 1 (see VCF columns below). In the long run, it might be beneficial to catch these cases in the VQSR indel modeling.
1 33745932 . G GC 2175.55 PASS AC=131;AF=1.743e-03;AN=75168;BaseQRankSum=-1.540e-01;CCC=75168;ClippingRankSum=0.306;DP=886479;FS=0.000;GQ_MEAN=58.06;GQ_STDDEV=13.70;HWP=1.0000;InbreedingCoeff=-0.0041;MLEAC=86;MLEAF=1.144e-03;MQ=59.62;MQ0=0;MQRankSum=0.457;NCC=23;QD=0.55;ReadPosRankSum=-2.130e-01;VQSLOD=2.00;culprit=QD
1 40028015 . T TG 3661.09 PASS AC=143;AF=1.902e-03;AN=75202;BaseQRankSum=0.169;CCC=75202;ClippingRankSum=0.331;DP=1059106;FS=0.000;GQ_MEAN=61.43;GQ_STDDEV=19.66;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=93;MLEAF=1.237e-03;MQ=59.79;MQ0=0;MQRankSum=0.331;NCC=6;QD=0.53;ReadPosRankSum=-1.760e-01;VQSLOD=2.03;culprit=QD
2 220504281 . T TG 2883 PASS AC=98;AF=1.303e-03;AN=75214;BaseQRankSum=0.095;CCC=75214;ClippingRankSum=0.040;DP=1139600;FS=0.000;GQ_MEAN=68.44;GQ_STDDEV=17.59;HWP=1.0000;InbreedingCoeff=-0.0022;MLEAC=57;MLEAF=7.578e-04;MQ=59.42;MQ0=0;MQRankSum=0.401;NCC=0;QD=0.44;ReadPosRankSum=-9.200e-02;VQSLOD=2.00;culprit=QD
2 207460860 . A AC 8874.57 PASS AC=204;AF=2.713e-03;AN=75206;BaseQRankSum=-2.320e-01;CCC=75206;ClippingRankSum=0.171;DP=1112871;FS=0.000;GQ_MEAN=66.72;GQ_STDDEV=20.20;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=153;MLEAF=2.034e-03;MQ=59.71;MQ0=0;MQRankSum=0.511;NCC=4;QD=0.38;ReadPosRankSum=-3.000e-01;VQSLOD=2.06;culprit=FS
4 1843235 . C CG 6396.49 PASS AC=186;AF=2.473e-03;AN=75208;BaseQRankSum=0.036;CCC=75208;ClippingRankSum=0.395;DP=945515;FS=0.000;GQ_MEAN=61.89;GQ_STDDEV=11.82;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=145;MLEAF=1.928e-03;MQ=59.45;MQ0=0;MQRankSum=0.316;NCC=3;QD=0.64;ReadPosRankSum=-1.110e-01;VQSLOD=2.24;culprit=QD
7 100245130 . T TG 1944.22 PASS AC=79;AF=1.050e-03;AN=75208;BaseQRankSum=-4.140e-01;CCC=75208;ClippingRankSum=-2.170e-01;DP=1163716;FS=0.000;GQ_MEAN=67.73;GQ_STDDEV=17.45;HWP=1.0000;InbreedingCoeff=-0.0025;MLEAC=52;MLEAF=6.914e-04;MQ=60.00;MQ0=0;MQRankSum=0.376;NCC=3;QD=0.84;ReadPosRankSum=-1.420e-01;VQSLOD=1.85;culprit=QD
10 103901076 . G GC 33042.2 PASS AC=500;AF=6.648e-03;AN=75214;BaseQRankSum=-1.710e-01;CCC=75214;ClippingRankSum=0.623;DP=1181015;FS=0.000;GQ_MEAN=66.17;GQ_STDDEV=17.62;HWP=0.3766;InbreedingCoeff=-0.0089;MLEAC=439;MLEAF=5.837e-03;MQ=59.45;MQ0=0;MQRankSum=0.416;NCC=0;QD=1.06;ReadPosRankSum=-4.190e-01;VQSLOD=2.35;culprit=QD
10 103907124 . T TC 2122.03 PASS AC=69;AF=9.175e-04;AN=75208;BaseQRankSum=-2.000e-01;CCC=75208;ClippingRankSum=0.276;DP=1451377;FS=0.000;GQ_MEAN=80.97;GQ_STDDEV=23.97;HWP=1.0000;InbreedingCoeff=-0.0016;MLEAC=46;MLEAF=6.116e-04;MQ=59.51;MQ0=0;MQRankSum=0.503;NCC=3;QD=0.32;ReadPosRankSum=-4.650e-01;VQSLOD=2.19;culprit=FS
10 105484097 . T TG 3059.59 PASS AC=158;AF=2.101e-03;AN=75212;BaseQRankSum=0.00;CCC=75212;ClippingRankSum=0.168;DP=928568;FS=0.000;GQ_MEAN=60.73;GQ_STDDEV=15.73;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=109;MLEAF=1.449e-03;MQ=59.26;MQ0=0;MQRankSum=0.302;NCC=1;QD=0.52;ReadPosRankSum=-1.350e-01;VQSLOD=1.52;culprit=QD
12 96641028 . G GC 2486.52 PASS AC=122;AF=1.622e-03;AN=75210;BaseQRankSum=-2.190e-01;CCC=75210;ClippingRankSum=0.574;DP=1055608;FS=0.000;GQ_MEAN=65.01;GQ_STDDEV=15.38;HWP=1.0000;InbreedingCoeff=-0.0030;MLEAC=80;MLEAF=1.064e-03;MQ=59.45;MQ0=0;MQRankSum=0.306;NCC=2;QD=0.50;ReadPosRankSum=-1.290e-01;VQSLOD=2.18;culprit=QD
14 45369723 . T TG 12376.8 PASS AC=182;AF=2.420e-03;AN=75210;BaseQRankSum=-1.810e-01;CCC=75210;ClippingRankSum=0.00;DP=1041458;FS=0.000;GQ_MEAN=67.78;GQ_STDDEV=20.59;HWP=1.0000;InbreedingCoeff=-0.0033;MLEAC=121;MLEAF=1.609e-03;MQ=59.77;MQ0=0;MQRankSum=0.410;NCC=2;QD=1.24;ReadPosRankSum=-1.100e-01;VQSLOD=2.08;culprit=QD
19 8193948 . C CG 2846.04 PASS AC=102;AF=1.356e-03;AN=75204;BaseQRankSum=0.337;CCC=75204;ClippingRankSum=0.513;DP=1255955;FS=0.000;GQ_MEAN=76.78;GQ_STDDEV=24.97;HWP=1.0000;InbreedingCoeff=-0.0017;MLEAC=65;MLEAF=8.643e-04;MQ=59.81;MQ0=0;MQRankSum=0.514;NCC=5;QD=0.69;ReadPosRankSum=-1.050e-01;VQSLOD=2.07;culprit=QD
A lot of the training data for ExAC INDELs are low QD
About 35% of these training INDELs are multiallelic, compared with 47% of training INDELs with QD <= 2.
QD for SNPs has much smaller variance (and a lot more data)
For SNPs 14% overall are multiallelic with 13% of QD <= 2.0 SNPs being multiallelic.
So it's not a huge surprise that low QD indels pass because there's a fair amount of low QD training data. Given that a lot of the low QD indel training sites are multiallelic (with drastically different alt allele lengths in many cases), I wonder what the QD at those sites would be if we just used the alleles matching the truth data. This goes back to my dream of allele-based filtering in VQSR.
Yeah, this is more about how VQSR development, specifically how it determines the negative model. It's hard to force a negative model around low QD because we don't have much crappy data there, but the crappy data that is there manages to squeak by. I suspect an SVM approach with negative training data derived from hard filters might work better here.
MIGRATED FROM GATK3
@ldgauthier commented on Thu Mar 12 2015
From Grace Tiao:
We found a large number of PASS frameshift indels that looked suspicious in our pan-cancer germline callset (~37k samples) that are driving the top genes in some of our case-control associations. These indels all have the following properties:
The variant we looked at in the meeting had a very low QD score, and Eric Banks's suggestion for the short-term is to filter our callset for low QD indels (i.e., remove all indels with QD<1). The hard filter will take care of most of these faulty variants, but at least two of the suspicious indels have QD scores > 1 (see VCF columns below). In the long run, it might be beneficial to catch these cases in the VQSR indel modeling.
1 33745932 . G GC 2175.55 PASS AC=131;AF=1.743e-03;AN=75168;BaseQRankSum=-1.540e-01;CCC=75168;ClippingRankSum=0.306;DP=886479;FS=0.000;GQ_MEAN=58.06;GQ_STDDEV=13.70;HWP=1.0000;InbreedingCoeff=-0.0041;MLEAC=86;MLEAF=1.144e-03;MQ=59.62;MQ0=0;MQRankSum=0.457;NCC=23;QD=0.55;ReadPosRankSum=-2.130e-01;VQSLOD=2.00;culprit=QD 1 40028015 . T TG 3661.09 PASS AC=143;AF=1.902e-03;AN=75202;BaseQRankSum=0.169;CCC=75202;ClippingRankSum=0.331;DP=1059106;FS=0.000;GQ_MEAN=61.43;GQ_STDDEV=19.66;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=93;MLEAF=1.237e-03;MQ=59.79;MQ0=0;MQRankSum=0.331;NCC=6;QD=0.53;ReadPosRankSum=-1.760e-01;VQSLOD=2.03;culprit=QD 2 220504281 . T TG 2883 PASS AC=98;AF=1.303e-03;AN=75214;BaseQRankSum=0.095;CCC=75214;ClippingRankSum=0.040;DP=1139600;FS=0.000;GQ_MEAN=68.44;GQ_STDDEV=17.59;HWP=1.0000;InbreedingCoeff=-0.0022;MLEAC=57;MLEAF=7.578e-04;MQ=59.42;MQ0=0;MQRankSum=0.401;NCC=0;QD=0.44;ReadPosRankSum=-9.200e-02;VQSLOD=2.00;culprit=QD 2 207460860 . A AC 8874.57 PASS AC=204;AF=2.713e-03;AN=75206;BaseQRankSum=-2.320e-01;CCC=75206;ClippingRankSum=0.171;DP=1112871;FS=0.000;GQ_MEAN=66.72;GQ_STDDEV=20.20;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=153;MLEAF=2.034e-03;MQ=59.71;MQ0=0;MQRankSum=0.511;NCC=4;QD=0.38;ReadPosRankSum=-3.000e-01;VQSLOD=2.06;culprit=FS 4 1843235 . C CG 6396.49 PASS AC=186;AF=2.473e-03;AN=75208;BaseQRankSum=0.036;CCC=75208;ClippingRankSum=0.395;DP=945515;FS=0.000;GQ_MEAN=61.89;GQ_STDDEV=11.82;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=145;MLEAF=1.928e-03;MQ=59.45;MQ0=0;MQRankSum=0.316;NCC=3;QD=0.64;ReadPosRankSum=-1.110e-01;VQSLOD=2.24;culprit=QD 7 100245130 . T TG 1944.22 PASS AC=79;AF=1.050e-03;AN=75208;BaseQRankSum=-4.140e-01;CCC=75208;ClippingRankSum=-2.170e-01;DP=1163716;FS=0.000;GQ_MEAN=67.73;GQ_STDDEV=17.45;HWP=1.0000;InbreedingCoeff=-0.0025;MLEAC=52;MLEAF=6.914e-04;MQ=60.00;MQ0=0;MQRankSum=0.376;NCC=3;QD=0.84;ReadPosRankSum=-1.420e-01;VQSLOD=1.85;culprit=QD 10 103901076 . G GC 33042.2 PASS AC=500;AF=6.648e-03;AN=75214;BaseQRankSum=-1.710e-01;CCC=75214;ClippingRankSum=0.623;DP=1181015;FS=0.000;GQ_MEAN=66.17;GQ_STDDEV=17.62;HWP=0.3766;InbreedingCoeff=-0.0089;MLEAC=439;MLEAF=5.837e-03;MQ=59.45;MQ0=0;MQRankSum=0.416;NCC=0;QD=1.06;ReadPosRankSum=-4.190e-01;VQSLOD=2.35;culprit=QD 10 103907124 . T TC 2122.03 PASS AC=69;AF=9.175e-04;AN=75208;BaseQRankSum=-2.000e-01;CCC=75208;ClippingRankSum=0.276;DP=1451377;FS=0.000;GQ_MEAN=80.97;GQ_STDDEV=23.97;HWP=1.0000;InbreedingCoeff=-0.0016;MLEAC=46;MLEAF=6.116e-04;MQ=59.51;MQ0=0;MQRankSum=0.503;NCC=3;QD=0.32;ReadPosRankSum=-4.650e-01;VQSLOD=2.19;culprit=FS 10 105484097 . T TG 3059.59 PASS AC=158;AF=2.101e-03;AN=75212;BaseQRankSum=0.00;CCC=75212;ClippingRankSum=0.168;DP=928568;FS=0.000;GQ_MEAN=60.73;GQ_STDDEV=15.73;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=109;MLEAF=1.449e-03;MQ=59.26;MQ0=0;MQRankSum=0.302;NCC=1;QD=0.52;ReadPosRankSum=-1.350e-01;VQSLOD=1.52;culprit=QD 12 96641028 . G GC 2486.52 PASS AC=122;AF=1.622e-03;AN=75210;BaseQRankSum=-2.190e-01;CCC=75210;ClippingRankSum=0.574;DP=1055608;FS=0.000;GQ_MEAN=65.01;GQ_STDDEV=15.38;HWP=1.0000;InbreedingCoeff=-0.0030;MLEAC=80;MLEAF=1.064e-03;MQ=59.45;MQ0=0;MQRankSum=0.306;NCC=2;QD=0.50;ReadPosRankSum=-1.290e-01;VQSLOD=2.18;culprit=QD 14 45369723 . T TG 12376.8 PASS AC=182;AF=2.420e-03;AN=75210;BaseQRankSum=-1.810e-01;CCC=75210;ClippingRankSum=0.00;DP=1041458;FS=0.000;GQ_MEAN=67.78;GQ_STDDEV=20.59;HWP=1.0000;InbreedingCoeff=-0.0033;MLEAC=121;MLEAF=1.609e-03;MQ=59.77;MQ0=0;MQRankSum=0.410;NCC=2;QD=1.24;ReadPosRankSum=-1.100e-01;VQSLOD=2.08;culprit=QD 19 8193948 . C CG 2846.04 PASS AC=102;AF=1.356e-03;AN=75204;BaseQRankSum=0.337;CCC=75204;ClippingRankSum=0.513;DP=1255955;FS=0.000;GQ_MEAN=76.78;GQ_STDDEV=24.97;HWP=1.0000;InbreedingCoeff=-0.0017;MLEAC=65;MLEAF=8.643e-04;MQ=59.81;MQ0=0;MQRankSum=0.514;NCC=5;QD=0.69;ReadPosRankSum=-1.050e-01;VQSLOD=2.07;culprit=QD
@ldgauthier commented on Thu Mar 12 2015
Are the VQSR output graphs for the ExAC dataset available anywhere? I really want to know what the QD distribution and fit for the indels look like.
@eitanbanks commented on Fri Mar 13 2015
Thanks for looking into this. It's definitely a problem.
@ldgauthier commented on Tue Mar 17 2015
A lot of the training data for ExAC INDELs are low QD About 35% of these training INDELs are multiallelic, compared with 47% of training INDELs with QD <= 2.
QD for SNPs has much smaller variance (and a lot more data) For SNPs 14% overall are multiallelic with 13% of QD <= 2.0 SNPs being multiallelic.
So it's not a huge surprise that low QD indels pass because there's a fair amount of low QD training data. Given that a lot of the low QD indel training sites are multiallelic (with drastically different alt allele lengths in many cases), I wonder what the QD at those sites would be if we just used the alleles matching the truth data. This goes back to my dream of allele-based filtering in VQSR.
@vdauwera commented on Fri Dec 18 2015
@ldgauthier is this issue still useful to keep open, considering you're actively working on allele-specific methods?
@ldgauthier commented on Fri Dec 18 2015
Yeah, this is more about how VQSR development, specifically how it determines the negative model. It's hard to force a negative model around low QD because we don't have much crappy data there, but the crappy data that is there manages to squeak by. I suspect an SVM approach with negative training data derived from hard filters might work better here.
@vdauwera commented on Mon Nov 14 2016
@ldgauthier Now that we have allele-specific VQSR, do we still need this ticket?
@ldgauthier commented on Mon Nov 14 2016
Yes we do. Sad but true. It's something we've been focusing on with the new VQSR model work, though.
@vdauwera commented on Mon Nov 14 2016
Ah, ok. So keep it in the GATK3 repo then, for now?
@ldgauthier commented on Tue Nov 15 2016
Yes. I believe @lucidtronix is doing VQSR development in GATK3 and we'll port later.
@vdauwera commented on Mon Mar 20 2017
@ldgauthier @lucidtronix Any update on this since I heard VQSR got ported to GATK4?
@ldgauthier commented on Mon Mar 20 2017
It's unlikely the behavior has changed. For gnomad we used hard filters to address the problem, which is probably a good global recommendation.
On Mar 20, 2017 1:35 PM, "Geraldine Van der Auwera" < notifications@github.com> wrote:
@vdauwera commented on Mon Mar 20 2017
OK, that makes sense, thanks. Do you want me to migrate the issue to GATK4? Otherwise I'll just close it out here as WONTFIX.
@ldgauthier commented on Tue Mar 21 2017
Somewhere we need a record of updates to filtering best practices until we publish a new thing, so yeah, please migrate.
On Mar 20, 2017 6:36 PM, "Geraldine Van der Auwera" < notifications@github.com> wrote: