broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 590 forks source link

Mutect2 should disable the MateOnSameContigOrNoMappedMateReadFilter by default #3514

Closed sooheelee closed 6 years ago

sooheelee commented 7 years ago

@vdauwera we should modify the M2 WDLs. @davidbenjamin this will improve your sensitivity.

Currently Mutect2 uses the MateOnSameContigOrNoMappedMateReadFilter filter that filters out any paired read whose mate maps to a different contig. This filter, if I recall correctly, used to be the hidden filter in HaplotypeCaller code that could not be turned off. It necessitated that I remove 0x1 flags in the GRCh38 tutorial (see section 6.1 of https://gatkforums.broadinstitute.org/gatk/discussion/8017/) so as to be able to call variants associated with a sample with an alternative haplotype.

This filter is now exposed so that users can disable it. In addition to disabling this filter for ALT-aware data, I recommend we turn it off by default for somatic analyses, for any reference. This allows us (i) to call on ALT-aware mappings if data is such and (ii) call on SNPs and indels generated by putative structural variants that go across contigs.

I know that this filter is active in the GATK4.beta.3-Mutect2 (see last line):

WMCF9-CB5:align shlee$ gatk-launch Mutect2 -R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta -I hcc1143_N_subset500.bam -tumor HCC1143_normal -O 1_normalforpon.vcf.gz
Using GATK jar /Applications/genomicstools/gatk/gatk-4.latest/gatk-package-4.beta.3-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /Applications/genomicstools/gatk/gatk-4.latest/gatk-package-4.beta.3-local.jar Mutect2 -R /Users/shlee/Documents/ref/hg38/Homo_sapiens_assembly38.fasta -I hcc1143_N_subset500.bam -tumor HCC1143_normal -O 1_normalforpon.vcf.gz
19:26:43.105 INFO  NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Applications/genomicstools/gatk/gatk-4.latest/gatk-package-4.beta.3-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
[August 24, 2017 7:26:43 PM EDT] Mutect2  --tumorSampleName HCC1143_normal --output 1_normalforpon.vcf.gz --input hcc1143_N_subset500.bam --reference /Users/shlee/Documents/ref/hg38/Homo_sapiens_assembly38.fasta  --genotypePonSites false --af_of_alleles_not_in_resource 0.001 --log_somatic_prior -6.0 --tumor_lod_to_emit 3.0 --initial_tumor_lod 2.0 --max_population_af 0.01 --normal_lod 2.2 --annotation Coverage --annotation DepthPerAlleleBySample --annotation TandemRepeat --annotation OxoGReadCounts --annotation ClippedBases --annotation ReadPosition --annotation BaseQuality --annotation MappingQuality --annotation FragmentLength --annotation StrandArtifact --dontTrimActiveRegions false --maxDiscARExtension 25 --maxGGAARExtension 300 --paddingAroundIndels 150 --paddingAroundSNPs 20 --kmerSize 10 --kmerSize 25 --dontIncreaseKmerSizesForCycles false --allowNonUniqueKmersInRef false --numPruningSamples 1 --recoverDanglingHeads false --doNotRecoverDanglingBranches false --minDanglingBranchLength 4 --consensus false --maxNumHaplotypesInPopulation 128 --errorCorrectKmers false --minPruning 2 --debugGraphTransformations false --kmerLengthForReadErrorCorrection 25 --minObservationsForKmerToBeSolid 20 --likelihoodCalculationEngine PairHMM --base_quality_score_threshold 18 --gcpHMM 10 --pair_hmm_implementation FASTEST_AVAILABLE --pcr_indel_model CONSERVATIVE --phredScaledGlobalReadMismappingRate 45 --nativePairHmmThreads 4 --useDoublePrecision false --debug false --useFilteredReadsForAnnotations false --emitRefConfidence NONE --bamWriterType CALLED_HAPLOTYPES --disableOptimizations false --justDetermineActiveRegions false --dontGenotype false --dontUseSoftClippedBases false --captureAssemblyFailureBAM false --errorCorrectReads false --doNotRunPhysicalPhasing false --min_base_quality_score 10 --useNewAFCalculator false --annotateNDA false --heterozygosity 0.001 --indel_heterozygosity 1.25E-4 --heterozygosity_stdev 0.01 --standard_min_confidence_threshold_for_calling 10.0 --max_alternate_alleles 6 --max_genotype_count 1024 --sample_ploidy 2 --genotyping_mode DISCOVERY --contamination_fraction_to_filter 0.0 --output_mode EMIT_VARIANTS_ONLY --allSitePLs false --readShardSize 5000 --readShardPadding 100 --minAssemblyRegionSize 50 --maxAssemblyRegionSize 300 --assemblyRegionPadding 100 --maxReadsPerAlignmentStart 50 --activeProbabilityThreshold 0.002 --maxProbPropagationDistance 50 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false --minimumMappingQuality 20
[August 24, 2017 7:26:43 PM EDT] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Version: 4.beta.3
19:26:43.243 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 1
19:26:43.243 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:26:43.243 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:26:43.243 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:26:43.243 INFO  Mutect2 - Deflater: IntelDeflater
19:26:43.243 INFO  Mutect2 - Inflater: IntelInflater
19:26:43.243 INFO  Mutect2 - GCS max retries/reopens: 20
19:26:43.243 INFO  Mutect2 - Using google-cloud-java patch 317951be3c2e898e3916a4b1abf5a9c220d84df8
19:26:43.243 INFO  Mutect2 - Initializing engine
19:26:43.871 INFO  Mutect2 - Done initializing engine
19:26:44.130 WARN  PossibleDeNovo - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.
19:26:44.346 WARN  PossibleDeNovo - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.
19:26:44.624 WARN  NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
19:26:44.624 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
19:26:44.643 INFO  NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Applications/genomicstools/gatk/gatk-4.latest/gatk-package-4.beta.3-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
[WARNING] Ignoring request for 4 threads; not using OpenMP implementation
19:26:44.664 INFO  PairHMM - Using the AVX-accelerated native PairHMM implementation
19:26:44.723 INFO  ProgressMeter - Starting traversal
19:26:44.724 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
19:26:54.942 INFO  ProgressMeter -        chr6:29944953              0.2                   200           1174.5
19:27:04.954 INFO  ProgressMeter -       chr19:15244950              0.3                   460           1364.3
19:27:07.668 INFO  Mutect2 - 18876 read(s) filtered by: (((((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND ) AND MateOnSameContigOrNoMappedMateReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
  18876 read(s) filtered by: ((((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND ) AND MateOnSameContigOrNoMappedMateReadFilter) AND GoodCigarReadFilter)
      18876 read(s) filtered by: (((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND ) AND MateOnSameContigOrNoMappedMateReadFilter)
          18599 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND )
              18599 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
                  18599 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
                      18599 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter) AND NotDuplicateReadFilter)
                          9376 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryAlignmentReadFilter)
                              9376 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter) AND MappedReadFilter)
                                  9376 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappingQualityNotZeroReadFilter)
                                      9376 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
                                          9376 read(s) filtered by: MappingQualityReadFilter 
                          9223 read(s) filtered by: NotDuplicateReadFilter 
          277 read(s) filtered by: MateOnSameContigOrNoMappedMateReadFilter 

Also, can we make the standout that counts filtered reads more human readable please.

sooheelee commented 7 years ago

In addition, we should be able to supply the GRCh38.fasta.alt file to our tools so GATK can intelligently handle the ALT contigs. That is, the program will know that reads mapped to chr1 or chr1_xyz_ALT are in fact on the same molecular chromosome (contig).

vdauwera commented 7 years ago

Making the callers ALT aware is a valid feature request -- but it's a big one. You'll want to open a separate issue ticket for that.

sooheelee commented 7 years ago

Ok.

davidbenjamin commented 6 years ago

@sooheelee Done in PR #4466 after verifying that it doesn't hurt sensitivity on hg19.