broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 591 forks source link

Indels with QD < 1.0 still pass VQSR #2508

Open vdauwera opened 7 years ago

vdauwera commented 7 years ago

MIGRATED FROM GATK3

@ldgauthier commented on Thu Mar 12 2015

From Grace Tiao:

We found a large number of PASS frameshift indels that looked suspicious in our pan-cancer germline callset (~37k samples) that are driving the top genes in some of our case-control associations. These indels all have the following properties:

  1. They are flagged as PASS by VQSR and are PASS in ExAC; but examination in IGV shows that these variants are all present at extremely low allelic fractions. (These are all normal samples.)
  2. These variants occur in the specific context of 7-base (and in one case, 8-base) homopolymer runs.

The variant we looked at in the meeting had a very low QD score, and Eric Banks's suggestion for the short-term is to filter our callset for low QD indels (i.e., remove all indels with QD<1). The hard filter will take care of most of these faulty variants, but at least two of the suspicious indels have QD scores > 1 (see VCF columns below). In the long run, it might be beneficial to catch these cases in the VQSR indel modeling.

1 33745932 . G GC 2175.55 PASS AC=131;AF=1.743e-03;AN=75168;BaseQRankSum=-1.540e-01;CCC=75168;ClippingRankSum=0.306;DP=886479;FS=0.000;GQ_MEAN=58.06;GQ_STDDEV=13.70;HWP=1.0000;InbreedingCoeff=-0.0041;MLEAC=86;MLEAF=1.144e-03;MQ=59.62;MQ0=0;MQRankSum=0.457;NCC=23;QD=0.55;ReadPosRankSum=-2.130e-01;VQSLOD=2.00;culprit=QD 1 40028015 . T TG 3661.09 PASS AC=143;AF=1.902e-03;AN=75202;BaseQRankSum=0.169;CCC=75202;ClippingRankSum=0.331;DP=1059106;FS=0.000;GQ_MEAN=61.43;GQ_STDDEV=19.66;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=93;MLEAF=1.237e-03;MQ=59.79;MQ0=0;MQRankSum=0.331;NCC=6;QD=0.53;ReadPosRankSum=-1.760e-01;VQSLOD=2.03;culprit=QD 2 220504281 . T TG 2883 PASS AC=98;AF=1.303e-03;AN=75214;BaseQRankSum=0.095;CCC=75214;ClippingRankSum=0.040;DP=1139600;FS=0.000;GQ_MEAN=68.44;GQ_STDDEV=17.59;HWP=1.0000;InbreedingCoeff=-0.0022;MLEAC=57;MLEAF=7.578e-04;MQ=59.42;MQ0=0;MQRankSum=0.401;NCC=0;QD=0.44;ReadPosRankSum=-9.200e-02;VQSLOD=2.00;culprit=QD 2 207460860 . A AC 8874.57 PASS AC=204;AF=2.713e-03;AN=75206;BaseQRankSum=-2.320e-01;CCC=75206;ClippingRankSum=0.171;DP=1112871;FS=0.000;GQ_MEAN=66.72;GQ_STDDEV=20.20;HWP=1.0000;InbreedingCoeff=-0.0032;MLEAC=153;MLEAF=2.034e-03;MQ=59.71;MQ0=0;MQRankSum=0.511;NCC=4;QD=0.38;ReadPosRankSum=-3.000e-01;VQSLOD=2.06;culprit=FS 4 1843235 . C CG 6396.49 PASS AC=186;AF=2.473e-03;AN=75208;BaseQRankSum=0.036;CCC=75208;ClippingRankSum=0.395;DP=945515;FS=0.000;GQ_MEAN=61.89;GQ_STDDEV=11.82;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=145;MLEAF=1.928e-03;MQ=59.45;MQ0=0;MQRankSum=0.316;NCC=3;QD=0.64;ReadPosRankSum=-1.110e-01;VQSLOD=2.24;culprit=QD 7 100245130 . T TG 1944.22 PASS AC=79;AF=1.050e-03;AN=75208;BaseQRankSum=-4.140e-01;CCC=75208;ClippingRankSum=-2.170e-01;DP=1163716;FS=0.000;GQ_MEAN=67.73;GQ_STDDEV=17.45;HWP=1.0000;InbreedingCoeff=-0.0025;MLEAC=52;MLEAF=6.914e-04;MQ=60.00;MQ0=0;MQRankSum=0.376;NCC=3;QD=0.84;ReadPosRankSum=-1.420e-01;VQSLOD=1.85;culprit=QD 10 103901076 . G GC 33042.2 PASS AC=500;AF=6.648e-03;AN=75214;BaseQRankSum=-1.710e-01;CCC=75214;ClippingRankSum=0.623;DP=1181015;FS=0.000;GQ_MEAN=66.17;GQ_STDDEV=17.62;HWP=0.3766;InbreedingCoeff=-0.0089;MLEAC=439;MLEAF=5.837e-03;MQ=59.45;MQ0=0;MQRankSum=0.416;NCC=0;QD=1.06;ReadPosRankSum=-4.190e-01;VQSLOD=2.35;culprit=QD 10 103907124 . T TC 2122.03 PASS AC=69;AF=9.175e-04;AN=75208;BaseQRankSum=-2.000e-01;CCC=75208;ClippingRankSum=0.276;DP=1451377;FS=0.000;GQ_MEAN=80.97;GQ_STDDEV=23.97;HWP=1.0000;InbreedingCoeff=-0.0016;MLEAC=46;MLEAF=6.116e-04;MQ=59.51;MQ0=0;MQRankSum=0.503;NCC=3;QD=0.32;ReadPosRankSum=-4.650e-01;VQSLOD=2.19;culprit=FS 10 105484097 . T TG 3059.59 PASS AC=158;AF=2.101e-03;AN=75212;BaseQRankSum=0.00;CCC=75212;ClippingRankSum=0.168;DP=928568;FS=0.000;GQ_MEAN=60.73;GQ_STDDEV=15.73;HWP=1.0000;InbreedingCoeff=-0.0036;MLEAC=109;MLEAF=1.449e-03;MQ=59.26;MQ0=0;MQRankSum=0.302;NCC=1;QD=0.52;ReadPosRankSum=-1.350e-01;VQSLOD=1.52;culprit=QD 12 96641028 . G GC 2486.52 PASS AC=122;AF=1.622e-03;AN=75210;BaseQRankSum=-2.190e-01;CCC=75210;ClippingRankSum=0.574;DP=1055608;FS=0.000;GQ_MEAN=65.01;GQ_STDDEV=15.38;HWP=1.0000;InbreedingCoeff=-0.0030;MLEAC=80;MLEAF=1.064e-03;MQ=59.45;MQ0=0;MQRankSum=0.306;NCC=2;QD=0.50;ReadPosRankSum=-1.290e-01;VQSLOD=2.18;culprit=QD 14 45369723 . T TG 12376.8 PASS AC=182;AF=2.420e-03;AN=75210;BaseQRankSum=-1.810e-01;CCC=75210;ClippingRankSum=0.00;DP=1041458;FS=0.000;GQ_MEAN=67.78;GQ_STDDEV=20.59;HWP=1.0000;InbreedingCoeff=-0.0033;MLEAC=121;MLEAF=1.609e-03;MQ=59.77;MQ0=0;MQRankSum=0.410;NCC=2;QD=1.24;ReadPosRankSum=-1.100e-01;VQSLOD=2.08;culprit=QD 19 8193948 . C CG 2846.04 PASS AC=102;AF=1.356e-03;AN=75204;BaseQRankSum=0.337;CCC=75204;ClippingRankSum=0.513;DP=1255955;FS=0.000;GQ_MEAN=76.78;GQ_STDDEV=24.97;HWP=1.0000;InbreedingCoeff=-0.0017;MLEAC=65;MLEAF=8.643e-04;MQ=59.81;MQ0=0;MQRankSum=0.514;NCC=5;QD=0.69;ReadPosRankSum=-1.050e-01;VQSLOD=2.07;culprit=QD


@ldgauthier commented on Thu Mar 12 2015

Are the VQSR output graphs for the ExAC dataset available anywhere? I really want to know what the QD distribution and fit for the indels look like.


@eitanbanks commented on Fri Mar 13 2015

Thanks for looking into this. It's definitely a problem.


@ldgauthier commented on Tue Mar 17 2015

A lot of the training data for ExAC INDELs are low QD image About 35% of these training INDELs are multiallelic, compared with 47% of training INDELs with QD <= 2.

QD for SNPs has much smaller variance (and a lot more data) image For SNPs 14% overall are multiallelic with 13% of QD <= 2.0 SNPs being multiallelic.

So it's not a huge surprise that low QD indels pass because there's a fair amount of low QD training data. Given that a lot of the low QD indel training sites are multiallelic (with drastically different alt allele lengths in many cases), I wonder what the QD at those sites would be if we just used the alleles matching the truth data. This goes back to my dream of allele-based filtering in VQSR.


@vdauwera commented on Fri Dec 18 2015

@ldgauthier is this issue still useful to keep open, considering you're actively working on allele-specific methods?


@ldgauthier commented on Fri Dec 18 2015

Yeah, this is more about how VQSR development, specifically how it determines the negative model. It's hard to force a negative model around low QD because we don't have much crappy data there, but the crappy data that is there manages to squeak by. I suspect an SVM approach with negative training data derived from hard filters might work better here.


@vdauwera commented on Mon Nov 14 2016

@ldgauthier Now that we have allele-specific VQSR, do we still need this ticket?


@ldgauthier commented on Mon Nov 14 2016

Yes we do. Sad but true. It's something we've been focusing on with the new VQSR model work, though.


@vdauwera commented on Mon Nov 14 2016

Ah, ok. So keep it in the GATK3 repo then, for now?


@ldgauthier commented on Tue Nov 15 2016

Yes. I believe @lucidtronix is doing VQSR development in GATK3 and we'll port later.


@vdauwera commented on Mon Mar 20 2017

@ldgauthier @lucidtronix Any update on this since I heard VQSR got ported to GATK4?


@ldgauthier commented on Mon Mar 20 2017

It's unlikely the behavior has changed. For gnomad we used hard filters to address the problem, which is probably a good global recommendation.

On Mar 20, 2017 1:35 PM, "Geraldine Van der Auwera" < notifications@github.com> wrote:

@ldgauthier https://github.com/ldgauthier @lucidtronix https://github.com/lucidtronix Any update on this since I heard VQSR got ported to GATK4?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gsa-unstable/issues/868#issuecomment-287837505, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdLRwdezIkmt3uPqIABWLggVjRN3yks5rnrjegaJpZM4Dt4t7 .


@vdauwera commented on Mon Mar 20 2017

OK, that makes sense, thanks. Do you want me to migrate the issue to GATK4? Otherwise I'll just close it out here as WONTFIX.


@ldgauthier commented on Tue Mar 21 2017

Somewhere we need a record of updates to filtering best practices until we publish a new thing, so yeah, please migrate.

On Mar 20, 2017 6:36 PM, "Geraldine Van der Auwera" < notifications@github.com> wrote:

OK, that makes sense, thanks. Do you want me to migrate the issue to GATK4? Otherwise I'll just close it out here as WONTFIX.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gsa-unstable/issues/868#issuecomment-287919609, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRhdIeoOhkvNSNkC_XpxSZTt-kPdFDMks5rnv9RgaJpZM4Dt4t7 .

ldgauthier commented 6 years ago

As I set about updating the exome joint calling workflow over the next few days this is probably worth revisiting.