broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

Some concerns about the Mutect2 WDL pipeline #3061

Closed sooheelee closed 6 years ago

sooheelee commented 7 years ago

For discussion with @davidbenjamin. Conversation will benefit Mutect2 workflow documentation.

The Mutect2 WDL pipeline at https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2.wdl uses the following:

  1. SplitIntervals in default mode (clarified in #3032 that default INTERVAL_SUBDIVISION mode can cut into an interval in the intervals list)
     java -jar $GATK_JAR SplitIntervals -R ${ref_fasta} ${"-L " + intervals} -scatter ${scatter_count} -O interval-files
    cp interval-files/*.intervals .
  2. Scatter Mutect2 over files of intervals
    scatter (subintervals in SplitIntervals.interval_files ) {
    call M2 {
  3. MergeVCFs to collate the resulting VCF callsets
    java -Xmx2g -jar $GATK_JAR MergeVcfs -I ${sep=' -I ' input_vcfs} -O ${output_vcf_name}.vcf

Assuming Mutect2 handles active regions in the same manner as HaplotypeCaller, which will expand/pad active regions, then a consequence of the current workflow is potential duplicate calls (that can also differ slightly from each other e.g. due to rounding) for the same genomic locus that result from expansion of the active region into the edges of the intervals being split.

MergeVCFs as well as GatherVCFs, allows duplicate calls in the final VCF without checks. GatherVCFs does not allow for out-of-genomic-order-inputs and will give an error. However, I'm noticing something interesting (see below).

I would recommend creating intervals without splitting, i.e. with the BALANCING_WITHOUT_INTERVAL_SUBDIVISION option and setting the default of the tool to such to ward against accidental misuse.

sooheelee commented 7 years ago

I just heard that production is moving to GatherVCFs, which is getting some update that might be pertinent. I know that the workflows say that they are using MergeVCFs because of other issues:

  # using MergeVcfs instead of GatherVcfs so we can create indices
  # WARNING 2015-10-28 15:01:48 GatherVcfs  Index creation not currently supported when gathering block compressed VCFs.

This concern is germane to any WGS analyses and perhaps not concerning for WES analyses. So perhaps our current WDL workflows that use SplitIntervals could expressly state that they are not safe for WGS variant calling analyses.

sooheelee commented 7 years ago

Here are my tests of GatherVcfs that show something interesting.

input1

WMCF9-CB5:precomputed_results shlee$ gzcat split3_8.vcf.gz | grep -v '##' 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr6    33414233    .   GT  G   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=28.24;RPA=5,4;RU=T;STR;TLOD=154.53 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:112,0:0.00:0:0:3730,0:62:50 0/1:66,70:0.534:25:41:2209,2350:26:40
chr6    33442919    .   A   C   .   alt_allele_in_normal    ECNT=1;HCNT=32;MAX_ED=.;MIN_ED=.;NLOD=2.94;TLOD=6.35    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:88,17:0.193:4:13:0.765:2231,128:38:50   0/1:123,29:0.156:13:16:0.552:3124,283:60:60
chr6    71886972    .   TC  T   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12   GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8

input2

WMCF9-CB5:precomputed_results shlee$ gzcat split2_8.vcf.gz | grep -v '##' 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr6    71886972    .   TC  T   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12   GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6    118314029   .   TTTCAGGA    T   .   PASS    ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=80.46  GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:69,0:0.00:0:0:2115,0:35:34  0/1:68,26:0.261:13:10:2100,793:37:29

output GatherVcfs to .vcf.gz allows for duplicate records

WMCF9-CB5:precomputed_results shlee$ java -jar $PICARD GatherVcfs I=split3_8.vcf.gz I=split2_8.vcf.gz O=../test_gathervcf_split8_overlap.vcf.gz
[Wed Jun 07 14:51:32 EDT 2017] picard.vcf.GatherVcfs INPUT=[split3_8.vcf.gz, split2_8.vcf.gz] OUTPUT=../test_gathervcf_split8_overlap.vcf.gz    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jun 07 14:51:32 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
INFO    2017-06-07 14:51:32 GatherVcfs  Checking inputs.
INFO    2017-06-07 14:51:32 GatherVcfs  Checking file headers and first records to ensure compatibility.
INFO    2017-06-07 14:51:32 GatherVcfs  Gathering by copying gzip blocks. Will not be able to validate position non-overlap of files.
WARNING 2017-06-07 14:51:32 GatherVcfs  Index creation not currently supported when gathering block compressed VCFs.
INFO    2017-06-07 14:51:32 GatherVcfs  Gathering /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split3_8.vcf.gz
INFO    2017-06-07 14:51:32 GatherVcfs  Gathering /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split2_8.vcf.gz
[Wed Jun 07 14:51:32 EDT 2017] picard.vcf.GatherVcfs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
WMCF9-CB5:precomputed_results shlee$ gzcat ../test_gathervcf_split8_overlap.vcf.gz | grep -v '##' 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr6    33414233    .   GT  G   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=28.24;RPA=5,4;RU=T;STR;TLOD=154.53 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:112,0:0.00:0:0:3730,0:62:50 0/1:66,70:0.534:25:41:2209,2350:26:40
chr6    33442919    .   A   C   .   alt_allele_in_normal    ECNT=1;HCNT=32;MAX_ED=.;MIN_ED=.;NLOD=2.94;TLOD=6.35    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1  0/0:88,17:0.193:4:13:0.765:2231,128:38:50   0/1:123,29:0.156:13:16:0.552:3124,283:60:60
chr6    71886972    .   TC  T   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12   GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6    71886972    .   TC  T   .   PASS    ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=2.41;RPA=2,1;RU=C;STR;TLOD=14.12   GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:10,0:0.00:0:0:282,0:3:7 0/1:14,5:0.278:0:4:384,148:6:8
chr6    118314029   .   TTTCAGGA    T   .   PASS    ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=80.46  GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1    0/0:69,0:0.00:0:0:2115,0:35:34  0/1:68,26:0.261:13:10:2100,793:37:29

output to non-block-compressed gives desired error

WMCF9-CB5:precomputed_results shlee$ java -jar $PICARD GatherVcfs I=split3_8.vcf.gz I=split2_8.vcf.gz O=../test_gathervcf_split8_overlap2.vcf
[Wed Jun 07 14:56:26 EDT 2017] picard.vcf.GatherVcfs INPUT=[split3_8.vcf.gz, split2_8.vcf.gz] OUTPUT=../test_gathervcf_split8_overlap2.vcf    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jun 07 14:56:26 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
INFO    2017-06-07 14:56:26 GatherVcfs  Checking inputs.
INFO    2017-06-07 14:56:26 GatherVcfs  Checking file headers and first records to ensure compatibility.
INFO    2017-06-07 14:56:27 GatherVcfs  Gathering by conventional means.
[Wed Jun 07 14:56:27 EDT 2017] picard.vcf.GatherVcfs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: First variant in file /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split2_8.vcf.gz is at Unknown but last variant in earlier file /Users/shlee/Documents/workshop_materials/mutect2_tutorial/mutect2_handson/precomputed_results/split3_8.vcf.gz is at Unknown
    at picard.vcf.GatherVcfs.gatherConventionally(GatherVcfs.java:177)
    at picard.vcf.GatherVcfs.doWork(GatherVcfs.java:82)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

@LeeTL1220 may be interested to see this as he has been alluding to differences in handling based on +/- block compression.

davidbenjamin commented 7 years ago

@sooheelee It looks like split2_8.vcf.gz and split3_8.vcf.gz were generated with GATK 3 M2. In GATK 4 we only emit calls that are within the unpadded read shards, which I believe do not extend past the input intervals. Thus, as long as SplitIntervals returns disjoint subintervals (which it does), different scatters of M2 should produce disjoint calls. Could you try to reproduce the issue with GATK 4 M2?

sooheelee commented 7 years ago

Alright. I will assume this is not urgent unless you say otherwise.

sooheelee commented 6 years ago

I still feel strongly about not using outputs of SplitIntervals with any reassembly caller as these will expand active regions beyond the intervals given. It is thus possible to end up with calls that depend on identical reads. That is, read evidence is being counted twice for different calls and/or we end up with duplicate calls.

sooheelee commented 6 years ago

@davidbenjamin

davidbenjamin commented 6 years ago

@sooheelee Do we have a known case in GATK4? I thought we had some code to avoid it but when I recently looked I couldn't find it. So perhaps it was a figment of my imagination, in which case it's certainly an issue.

ldgauthier commented 6 years ago

I'm pretty sure we've done this in the exome production pipeline forever, though not with SplitIntervals. HaplotypeCaller (and presumably MuTect) will only output calls in the specified region, no matter how much territory goes into the active region. If a small interval gets split, it will likely have the same active region for both calls.

vdauwera commented 6 years ago

That's how I remember it as well. If anyone comes with a specific case we can reexamine, but pending that I think it's best to close this. Reopen if you disagree.