jjfarrell commented 6 years ago

Bug Report

Affected tool(s) or class(es)

CNNScoreVariant

Affected version(s)

4.0.7.0

Description

For 1,297,033 variant sites, the CNN_1D=-16.118.

Steps to reproduce

/run_cnn.sh adsp-5k.hg38.GATK.aws-batch_SNP_INDEL.chr22.4794samples.g.vcf.gz Using GATK jar /share/pkg/gatk/4.0.7.0/install/bin/gatk-package-4.0.7.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /share/pkf/adsp-5k.hg38.GATK.aws-batch_SNP_INDEL.chr22.4794samples.g.vcf.gz -R /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -O cnn/adsp-5k.hg38.GATK.aws-batch_SNP_INDEL.chr22.4794samples.g.vcf.gz 11:29:43.339 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/pkg/gatk/4.0.7.0/install/bin/gatk-package-4.0.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 11:29:43.467 INFO CNNScoreVariants - ------------------------------------------------------------ 11:29:43.467 INFO CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.0.7.0 11:29:43.467 INFO CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 11:29:43.468 INFO CNNScoreVariants - Executing as farrell@scc-hadoop.bu.edu on Linux v2.6.32-696.28.1.el6.x86_64 amd64 11:29:43.468 INFO CNNScoreVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_151-b12 11:29:43.469 INFO CNNScoreVariants - Start Date/Time: August 13, 2018 11:29:43 AM UTC 11:29:43.469 INFO CNNScoreVariants - ------------------------------------------------------------ 11:29:43.469 INFO CNNScoreVariants - ------------------------------------------------------------ 11:29:43.469 INFO CNNScoreVariants - HTSJDK Version: 2.16.0 11:29:43.469 INFO CNNScoreVariants - Picard Version: 2.18.7 11:29:43.469 INFO CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 11:29:43.470 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 11:29:43.470 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 11:29:43.470 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 11:29:43.470 INFO CNNScoreVariants - Deflater: IntelDeflater 11:29:43.470 INFO CNNScoreVariants - Inflater: IntelInflater 11:29:43.470 INFO CNNScoreVariants - GCS max retries/reopens: 20 11:29:43.470 INFO CNNScoreVariants - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 11:29:43.470 WARN CNNScoreVariants -

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: CNNScoreVariants is an EXPERIMENTAL tool and should not be used for production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

11:29:43.470 INFO CNNScoreVariants - Initializing engine 11:29:44.086 INFO FeatureManager - Using codec VCFCodec to read file file:///restricted/projectnb/casa/wgs.hg38/sv/gatk.cnn/vcf/adsp-5k.hg38.GATK.aws-batch_SNP_INDEL.chr22.4794samples.g.vcf.gz 11:29:44.281 INFO CNNScoreVariants - Done initializing engine 11:29:52.451 INFO CNNScoreVariants - Using key:CNN_1D for CNN architecture:/tmp/farrell/1d_cnn_mix_train_full_bn.3573521081782697200.json and weights:/tmp/farrell/1d_cnn_mix_train_full_bn.1075881893743930029.hd5 11:29:54.959 INFO ProgressMeter - Starting traversal 11:29:54.960 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 11:30:05.581 INFO ProgressMeter - chr22:10585117 0.2 5000 28251.2 11:30:16.545 INFO ProgressMeter - chr22:10652425 0.4 9000 25018.5 11:30:27.526 INFO ProgressMeter - chr22:10695391 0.5 14000 25793.8

Expected behavior

CNN_1D should have a range of values to indicate the log odds of the site being a true variant. When run on another VCF with just 1 subject, a range of values occurred. The VCF with just one CNN_1D value has 4794 subjects.

Actual behavior

The same CNN_1D value (-16.118) was assigned to both PASS and other tranches sites.

lucidtronix commented 6 years ago

@jjfarrell The CNNScoreVariants tool is designed for single-sample filtering and the model has only been trained on annotations from single sample runs. Does the VCF contain these info-level annotations: 'MQ', 'DP', 'SOR', 'FS', 'QD', 'MQRankSum', 'ReadPosRankSum'? It is possible that for a large cohort those annotation values will be very different from the single sample values the model was trained with, leading to the large negative LOD. If you can share the VCF or a subset of the VCF we can take a closer look.

jjfarrell commented 6 years ago

@lucidtronix

Thanks for the explanation. Here are a few lines with the FILTER and INFO fields. Let me know if you need a larger subset. I would guess that since DP is the total depth for 5000 subjects instead of just 1 that would likely be the culprit. Maybe average depth across samples could be used for a replacement variable when there are multiple samples.

VQSRTrancheSNP99.70to99.80      AC=8;AF=0.0008391;AN=9532;BaseQRankSum=-0.915;CNN_1D=-16.118;ClippingRankSum=0;DP=177452;ExcessHet=3.0267;FS=6.543;InbreedingCoeff=-0.002;MLEAC=8;MLEAF=0.0008391;MQ=71.41;MQRankSum=0.005;NEGATIVE_TRAIN_SITE;QD=10.23;ReadPosRankSum=0.483;SOR=1.039;VQSLOD=-3.08;culprit=MQRankSum
VQSRTrancheSNP99.70to99.80      AC=62,81;AF=0.006477,0.008462;AN=9570;BaseQRankSum=0.398;CNN_1D=-16.118;ClippingRankSum=0;DP=196764;ExcessHet=2.802;FS=0;InbreedingCoeff=-0.0026;MLEAC=61,67;MLEAF=0.006373,0.007;MQ=68.23;MQRankSum=-0.961;NEGATIVE_TRAIN_SITE;QD=3.96;ReadPosRankSum=-0.318;SOR=0.666;VQSLOD=-5.206;culprit=MQRankSum
VQSRTrancheSNP99.70to99.80      AC=117;AF=0.012;AN=9574;BaseQRankSum=0;CNN_1D=-16.118;ClippingRankSum=0;DP=203481;ExcessHet=1.8042;FS=0.593;InbreedingCoeff=0.0034;MLEAC=117;MLEAF=0.012;MQ=55.51;MQRankSum=1.32;NEGATIVE_TRAIN_SITE;QD=6.25;ReadPosRankSum=0.087;SOR=0.642;VQSLOD=-5.629;culprit=MQRankSum
VQSRTrancheSNP99.70to99.80      AC=21;AF=0.002193;AN=9574;BaseQRankSum=0;CNN_1D=-16.118;ClippingRankSum=0;DP=209533;ExcessHet=3.1253;FS=0;InbreedingCoeff=-0.0027;MLEAC=21;MLEAF=0.002193;MQ=57.8;MQRankSum=1.19;NEGATIVE_TRAIN_SITE;QD=4.35;ReadPosRankSum=-0.075;SOR=0.695;VQSLOD=-5.57;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=1;AF=0.0001044;AN=9572;BaseQRankSum=-2.379;CNN_1D=-16.118;ClippingRankSum=0;DP=212253;ExcessHet=3.0108;FS=0;InbreedingCoeff=-0.0003;MLEAC=1;MLEAF=0.0001044;MQ=186.25;MQRankSum=0.084;QD=0.28;ReadPosRankSum=-0.743;SOR=0.798;VQSLOD=-17.7;culprit=MQ
VQSRTrancheSNP99.80to99.90      AC=5214;AF=0.545;AN=9570;BaseQRankSum=-1.071;CNN_1D=-16.118;ClippingRankSum=0;DP=276543;ExcessHet=160;FS=10.66;InbreedingCoeff=-0.6888;MLEAC=1001;MLEAF=0.105;MQ=52.07;MQRankSum=-4.544;QD=4.38;ReadPosRankSum=0.877;SOR=0.905;VQSLOD=-16.46;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=99;AF=0.01;AN=9572;BaseQRankSum=0.561;CNN_1D=-16.118;ClippingRankSum=0;DP=268187;ExcessHet=5.3244;FS=3.001;InbreedingCoeff=-0.0109;MLEAC=99;MLEAF=0.01;MQ=58;MQRankSum=1.54;QD=5.65;ReadPosRankSum=-0.153;SOR=0.813;VQSLOD=-7.923;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=2;AF=0.000209;AN=9568;BaseQRankSum=0.382;CNN_1D=-16.118;ClippingRankSum=0;DP=268252;ExcessHet=3.0108;FS=4.32;InbreedingCoeff=-0.0002;MLEAC=2;MLEAF=0.000209;MQ=72.77;MQRankSum=-0.222;QD=7.08;ReadPosRankSum=-0.035;SOR=1.107;VQSLOD=-6.785;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=2;AF=0.000209;AN=9568;BaseQRankSum=-1.51;CNN_1D=-16.118;ClippingRankSum=0;DP=268301;ExcessHet=3.0108;FS=0;InbreedingCoeff=-0.0002;MLEAC=2;MLEAF=0.000209;MQ=73.41;MQRankSum=0.319;QD=6.02;ReadPosRankSum=-0.711;SOR=0.677;VQSLOD=-6.758;culprit=DP
VQSRTrancheSNP99.90to100.00     AC=1;AF=0.0001045;AN=9568;BaseQRankSum=-3.283;CNN_1D=-16.118;ClippingRankSum=0;DP=268087;ExcessHet=3.0103;FS=4.28;InbreedingCoeff=-0.0002;MLEAC=1;MLEAF=0.0001045;MQ=214.38;MQRankSum=0.337;QD=1.46;ReadPosRankSum=-0.141;SOR=2.07;VQSLOD=-26.94;culprit=MQ
VQSRTrancheINDEL99.80to99.90    AC=580,39;AF=0.061,0.004073;AN=9574;BaseQRankSum=0.073;CNN_1D=-16.118;ClippingRankSum=0;DP=269117;ExcessHet=61.5551;FS=6.865;InbreedingCoeff=-0.0577;MLEAC=558,39;MLEAF=0.058,0.004073;MQ=62.73;MQRankSum=-1.87;NEGATIVE_TRAIN_SITE;QD=2.98;ReadPosRankSum=0.659;SOR=1.554;VQSLOD=-4.621;culprit=FS
VQSRTrancheINDEL99.90to100.00   AC=4952,198,26;AF=0.517,0.021,0.002716;AN=9570;BaseQRankSum=-0.254;CNN_1D=-16.118;ClippingRankSum=0;DP=274973;ExcessHet=160;FS=20.217;InbreedingCoeff=-0.7273;MLEAC=1001,190,26;MLEAF=0.105,0.02,0.002716;MQ=51.29;MQRankSum=-4.267;NEGATIVE_TRAIN_SITE;QD=5.85;ReadPosRankSum=1.54;SOR=1.178;VQSLOD=-12.25;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=5026,2;AF=0.525,0.0002089;AN=9570;BaseQRankSum=-0.817;CNN_1D=-16.118;ClippingRankSum=0;DP=274387;ExcessHet=160;FS=17.112;InbreedingCoeff=-0.7525;MLEAC=1001,2;MLEAF=0.105,0.0002089;MQ=51.41;MQRankSum=-3.995;QD=5.73;ReadPosRankSum=1.45;SOR=1.185;VQSLOD=-14.57;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=1;AF=0.0001044;AN=9574;BaseQRankSum=-3.147;CNN_1D=-16.118;ClippingRankSum=0;DP=263720;ExcessHet=3.0103;FS=18.676;InbreedingCoeff=0.0011;MLEAC=1;MLEAF=0.0001044;MQ=181.04;MQRankSum=-0.233;QD=4.12;ReadPosRankSum=-0.697;SOR=2.655;VQSLOD=-19.93;culprit=MQ
VQSRTrancheSNP99.80to99.90      AC=5155;AF=0.539;AN=9570;BaseQRankSum=0;CNN_1D=-16.118;ClippingRankSum=0;DP=264943;ExcessHet=160;FS=24.422;InbreedingCoeff=-0.7117;MLEAC=1001;MLEAF=0.105;MQ=51.13;MQRankSum=-4.031;QD=4.05;ReadPosRankSum=1.02;SOR=1.32;VQSLOD=-15.3;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=2;AF=0.0002089;AN=9574;BaseQRankSum=-0.848;CNN_1D=-16.118;ClippingRankSum=0;DP=243627;ExcessHet=3.0108;FS=29.846;InbreedingCoeff=0.0007;MLEAC=1;MLEAF=0.0001044;MQ=97.17;MQRankSum=-1.05;QD=1.97;ReadPosRankSum=0.137;SOR=2.822;VQSLOD=-9.466;culprit=SOR
PASS    AC=2;AF=0.0002089;AN=9574;BaseQRankSum=-0.028;CNN_1D=-16.118;ClippingRankSum=0;DP=240126;ExcessHet=3.0108;FS=0;InbreedingCoeff=0.0001;MLEAC=2;MLEAF=0.0002089;MQ=113.85;MQRankSum=1.06;NEGATIVE_TRAIN_SITE;QD=12.16;ReadPosRankSum=0.02;SOR=0.657;VQSLOD=-0.7624;culprit=FS
VQSRTrancheSNP99.80to99.90      AC=751;AF=0.078;AN=9570;BaseQRankSum=0.851;CNN_1D=-16.118;ClippingRankSum=0;DP=239230;ExcessHet=43.4908;FS=1.046;InbreedingCoeff=-0.053;MLEAC=683;MLEAF=0.071;MQ=94.78;MQRankSum=-2.016;QD=1.74;ReadPosRankSum=-0.351;SOR=0.876;VQSLOD=-7.022;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=2;AF=0.0002089;AN=9572;BaseQRankSum=-2.445;CNN_1D=-16.118;ClippingRankSum=0;DP=232218;ExcessHet=3.0117;FS=1.387;InbreedingCoeff=-0.0009;MLEAC=2;MLEAF=0.0002089;MQ=130.25;MQRankSum=0.796;QD=0.74;ReadPosRankSum=-0.412;SOR=1.098;VQSLOD=-7.822;culprit=DP
VQSRTrancheSNP99.80to99.90      AC=5027;AF=0.525;AN=9568;BaseQRankSum=0.523;CNN_1D=-16.118;ClippingRankSum=0;DP=232117;ExcessHet=160;FS=27.939;InbreedingCoeff=-0.7236;MLEAC=1001;MLEAF=0.105;MQ=50.81;MQRankSum=-4.061;QD=4.78;ReadPosRankSum=-0.59;SOR=1.837;VQSLOD=-14.73;culprit=DP

jjfarrell commented 5 years ago

Are there any plans to develop models to run CNNScoreVariants for multiple sample vcfs?

ldgauthier commented 5 years ago

Not at the moment. The model as it exists now doesn't guarantee good results on multi-sample VCFs. Sam ran the CNN on the very large gnomAD cohort (~20,000 samples) and got results that were comparable with VQSR and their random forest, but had some strange biases. It would probably do a good job for small cohorts, but as the number of samples increases, the common variants take on tighter annotation distributions, which tends to penalize singletons in VQSR and in the CNN. We haven't put in any work to determine what a reasonable sample-size cutoff would be, so for now the recommendation is single-sample only.

cmnbroad commented 5 years ago

Closing via https://github.com/broadinstitute/gatk/pull/5548 where we added a warning when the input VCF has multiple samples. @lucidtronix feel free to reopen if more is planned.

broadinstitute / gatk

CNNScoreVariant CNN_1D=-16.118 for all sites #5101