Closed sooheelee closed 6 years ago
I just heard that production is moving to GatherVCFs, which is getting some update that might be pertinent. I know that the workflows say that they are using MergeVCFs because of other issues:
# using MergeVcfs instead of GatherVcfs so we can create indices
# WARNING 2015-10-28 15:01:48 GatherVcfs Index creation not currently supported when gathering block compressed VCFs.
This concern is germane to any WGS analyses and perhaps not concerning for WES analyses. So perhaps our current WDL workflows that use SplitIntervals could expressly state that they are not safe for WGS variant calling analyses.
Here are my tests of GatherVcfs that show something interesting.
input1
WMCF9-CB5:precomputed_results shlee$ gzcat split3_8.vcf.gz | grep -v '##'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr6 33414233 . GT G . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=28.24;RPA=5,4;RU=T;STR;TLOD=154.53 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:112,0:0.00:0:0:3730,0:62:50 0/1:66,70:0.534:25:41:2209,2350:26:40
chr6 33442919 . A C . alt_allele_in_normal ECNT=1;HCNT=32;MAX_ED=.;MIN_ED=.;NLOD=2.94;TLOD=6.35 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:88,17:0.193:4:13:0.765:2231,128:38:50 0/1:123,29:0.156:13:16:0.552:3124,283:60:60
chr6 71886972 . TC T . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
input2
WMCF9-CB5:precomputed_results shlee$ gzcat split2_8.vcf.gz | grep -v '##'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr6 71886972 . TC T . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6 118314029 . TTTCAGGA T . PASS ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=80.46 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:69,0:0.00:0:0:2115,0:35:34 0/1:68,26:0.261:13:10:2100,793:37:29
output GatherVcfs to .vcf.gz allows for duplicate records
WMCF9-CB5:precomputed_results shlee$ java -jar $PICARD GatherVcfs I=split3_8.vcf.gz I=split2_8.vcf.gz O=../test_gathervcf_split8_overlap.vcf.gz
[Wed Jun 07 14:51:32 EDT 2017] picard.vcf.GatherVcfs INPUT=[split3_8.vcf.gz, split2_8.vcf.gz] OUTPUT=../test_gathervcf_split8_overlap.vcf.gz VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jun 07 14:51:32 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
INFO 2017-06-07 14:51:32 GatherVcfs Checking inputs.
INFO 2017-06-07 14:51:32 GatherVcfs Checking file headers and first records to ensure compatibility.
INFO 2017-06-07 14:51:32 GatherVcfs Gathering by copying gzip blocks. Will not be able to validate position non-overlap of files.
WARNING 2017-06-07 14:51:32 GatherVcfs Index creation not currently supported when gathering block compressed VCFs.
INFO 2017-06-07 14:51:32 GatherVcfs Gathering /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split3_8.vcf.gz
INFO 2017-06-07 14:51:32 GatherVcfs Gathering /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split2_8.vcf.gz
[Wed Jun 07 14:51:32 EDT 2017] picard.vcf.GatherVcfs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
WMCF9-CB5:precomputed_results shlee$ gzcat ../test_gathervcf_split8_overlap.vcf.gz | grep -v '##'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr6 33414233 . GT G . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=28.24;RPA=5,4;RU=T;STR;TLOD=154.53 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:112,0:0.00:0:0:3730,0:62:50 0/1:66,70:0.534:25:41:2209,2350:26:40
chr6 33442919 . A C . alt_allele_in_normal ECNT=1;HCNT=32;MAX_ED=.;MIN_ED=.;NLOD=2.94;TLOD=6.35 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:88,17:0.193:4:13:0.765:2231,128:38:50 0/1:123,29:0.156:13:16:0.552:3124,283:60:60
chr6 71886972 . TC T . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6 71886972 . TC T . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6 118314029 . TTTCAGGA T . PASS ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=80.46 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:69,0:0.00:0:0:2115,0:35:34 0/1:68,26:0.261:13:10:2100,793:37:29
output to non-block-compressed gives desired error
WMCF9-CB5:precomputed_results shlee$ java -jar $PICARD GatherVcfs I=split3_8.vcf.gz I=split2_8.vcf.gz O=../test_gathervcf_split8_overlap2.vcf
[Wed Jun 07 14:56:26 EDT 2017] picard.vcf.GatherVcfs INPUT=[split3_8.vcf.gz, split2_8.vcf.gz] OUTPUT=../test_gathervcf_split8_overlap2.vcf VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jun 07 14:56:26 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
INFO 2017-06-07 14:56:26 GatherVcfs Checking inputs.
INFO 2017-06-07 14:56:26 GatherVcfs Checking file headers and first records to ensure compatibility.
INFO 2017-06-07 14:56:27 GatherVcfs Gathering by conventional means.
[Wed Jun 07 14:56:27 EDT 2017] picard.vcf.GatherVcfs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: First variant in file /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split2_8.vcf.gz is at Unknown but last variant in earlier file /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split3_8.vcf.gz is at Unknown
at picard.vcf.GatherVcfs.gatherConventionally(GatherVcfs.java:177)
at picard.vcf.GatherVcfs.doWork(GatherVcfs.java:82)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
@LeeTL1220 may be interested to see this as he has been alluding to differences in handling based on +/- block compression.
@sooheelee It looks like split2_8.vcf.gz and split3_8.vcf.gz were generated with GATK 3 M2. In GATK 4 we only emit calls that are within the unpadded read shards, which I believe do not extend past the input intervals. Thus, as long as SplitIntervals
returns disjoint subintervals (which it does), different scatters of M2 should produce disjoint calls. Could you try to reproduce the issue with GATK 4 M2?
Alright. I will assume this is not urgent unless you say otherwise.
I still feel strongly about not using outputs of SplitIntervals with any reassembly caller as these will expand active regions beyond the intervals given. It is thus possible to end up with calls that depend on identical reads. That is, read evidence is being counted twice for different calls and/or we end up with duplicate calls.
@davidbenjamin
@sooheelee Do we have a known case in GATK4? I thought we had some code to avoid it but when I recently looked I couldn't find it. So perhaps it was a figment of my imagination, in which case it's certainly an issue.
I'm pretty sure we've done this in the exome production pipeline forever, though not with SplitIntervals. HaplotypeCaller (and presumably MuTect) will only output calls in the specified region, no matter how much territory goes into the active region. If a small interval gets split, it will likely have the same active region for both calls.
That's how I remember it as well. If anyone comes with a specific case we can reexamine, but pending that I think it's best to close this. Reopen if you disagree.
For discussion with @davidbenjamin. Conversation will benefit Mutect2 workflow documentation.
The Mutect2 WDL pipeline at https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2.wdl uses the following:
Assuming Mutect2 handles active regions in the same manner as HaplotypeCaller, which will expand/pad active regions, then a consequence of the current workflow is potential duplicate calls (that can also differ slightly from each other e.g. due to rounding) for the same genomic locus that result from expansion of the active region into the edges of the intervals being split.
MergeVCFs as well as GatherVCFs, allows duplicate calls in the final VCF without checks. GatherVCFs does not allow for out-of-genomic-order-inputs and will give an error. However, I'm noticing something interesting (see below).
I would recommend creating intervals without splitting, i.e. with the BALANCING_WITHOUT_INTERVAL_SUBDIVISION option and setting the default of the tool to such to ward against accidental misuse.