broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Output active regions in HaplotypeCaller #862

Closed virtuabhi closed 7 years ago

virtuabhi commented 7 years ago

Hi

What would be the best way to output active regions from GATK4 HaplotypeCaller?

In GATK 3.x, an option --activeRegionOut <file name> could be used to write active regions in a TSV file. However, I am not able to find a similar option in GATK4 (September 2016 version commit id - b24fb75bf474d87f5e2259617fef268a145b28c6). The most relevant option seems to be -justDetermineActiveRegions <boolean>, but it is not writing (or generating) active regions to a file.

Thanks Abhishek

Output of GATK4 HaplotypeCaller help message:

Optional Arguments:

--annotation,-A:String        One or more specific annotations to apply to variant calls  This argument may be 
                              specified 0 or more times. Default value: null. 

--sample_name,-sn:String      Name of single sample to use from a multi-sample bam  Default value: null. 

--GVCFGQBands,-GQB:Integer    GQ thresholds for reference confidence bands  This argument may be specified 0 or more 
                              times. Default value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
                              20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 
                              42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 
                              99]. 

--indelSizeToEliminateInRefModel,-ERCIS:Integer
                              The size of an indel to check for in the reference model  Default value: 10. 

--useAllelesTrigger,-allelesTrigger:Boolean
                              Use additional trigger on variants found in an external alleles file  Default value: 
                              false. Possible values: {true, false} 

--dontTrimActiveRegions,-dontTrimActiveRegions:Boolean
                              If specified, we will not trim down the active region from the full region (active + 
                              extension) to just the active interval for genotyping  Default value: false. Possible 
                              values: {true, false} 

--maxDiscARExtension,-maxDiscARExtension:Integer
                              the maximum extent into the full active region extension that we're willing to go in 
                              genotyping our events for discovery  Default value: 25. 

--maxGGAARExtension,-maxGGAARExtension:Integer
                              the maximum extent into the full active region extension that we're willing to go in 
                              genotyping our events for GGA mode  Default value: 300. 

--paddingAroundIndels,-paddingAroundIndels:Integer
                              Include at least this many bases around an event for calling indels  Default value: 150. 

--paddingAroundSNPs,-paddingAroundSNPs:Integer
                              Include at least this many bases around an event for calling snps  Default value: 20. 

--kmerSize,-kmerSize:Integer  Kmer size to use in the read threading assembler  This argument may be specified 0 or 
                              more times. Default value: [10, 25]. 

--dontIncreaseKmerSizesForCycles,-dontIncreaseKmerSizesForCycles:Boolean
                              Disable iterating over kmer sizes when graph cycles are detected  Default value: false. 
                              Possible values: {true, false} 

--allowNonUniqueKmersInRef,-allowNonUniqueKmersInRef:Boolean
                              Allow graphs that have non-unique kmers in the reference  Default value: false. Possible 
                              values: {true, false} 

--numPruningSamples,-numPruningSamples:Integer
                              Number of samples that must pass the minPruning threshold  Default value: 1. 

--recoverDanglingHeads,-recoverDanglingHeads:Boolean
                              This argument is deprecated since version 3.3  Default value: false. Possible values: 
                              {true, false} 

--doNotRecoverDanglingBranches,-doNotRecoverDanglingBranches:Boolean
                              Disable dangling head and tail recovery  Default value: false. Possible values: {true, 
                              false} 

--minDanglingBranchLength,-minDanglingBranchLength:Integer
                              Minimum length of a dangling branch to attempt recovery  Default value: 4. 

--consensus,-consensus:Boolean1000G consensus mode  Default value: false. Possible values: {true, false} 

--maxNumHaplotypesInPopulation,-maxNumHaplotypesInPopulation:Integer
                              Maximum number of haplotypes to consider for your population  Default value: 128. 

--errorCorrectKmers,-errorCorrectKmers:Boolean
                              Use an exploratory algorithm to error correct the kmers used during assembly  Default 
                              value: false. Possible values: {true, false} 

--minPruning,-minPruning:Integer
                              Minimum support to not prune paths in the graph  Default value: 2. 

--debugGraphTransformations,-debugGraphTransformations:Boolean
                              Write DOT formatted graph files out of the assembler for only this graph size  Default 
                              value: false. Possible values: {true, false} 

--graphOutput,-graph:String   Write debug assembly graph information to this file  Default value: null. 

--kmerLengthForReadErrorCorrection,-kmerLengthForReadErrorCorrection:Integer
                              Use an exploratory algorithm to error correct the kmers used during assembly  Default 
                              value: 25. 

--minObservationsForKmerToBeSolid,-minObservationsForKmerToBeSolid:Integer
                              A k-mer must be seen at least these times for it considered to be solid  Default value: 
                              20. 

--likelihoodCalculationEngine,-likelihoodEngine:Implementation
                              What likelihood calculation engine to use to calculate the relative likelihood of reads 
                              vs haplotypes  Default value: PairHMM. Possible values: {PairHMM, Random} 

--base_quality_score_threshold,-bqst:Byte
                              Base qualities below this threshold will be reduced to the minimum (6)  Default value: 
                              18. 

--gcpHMM,-gcpHMM:Integer      Flat gap continuation penalty for use in the Pair HMM  Default value: 10. 

--pair_hmm_implementation,-pairHMM:Implementation
                              The PairHMM implementation to use for genotype likelihood calculations  Default value: 
                              FASTEST_AVAILABLE. Possible values: {EXACT, ORIGINAL, LOGLESS_CACHING, 
                              AVX_LOGLESS_CACHING, FASTEST_AVAILABLE} 

--pcr_indel_model,-pcrModel:PCRErrorModel
                              The PCR indel model to use  Default value: CONSERVATIVE. Possible values: {NONE, HOSTILE, 
                              AGGRESSIVE, CONSERVATIVE} 

--phredScaledGlobalReadMismappingRate,-globalMAPQ:Integer
                              The global assumed mismapping rate for reads  Default value: 45. 

--dbsnp,-D:FeatureInput       dbSNP file  Default value: null. 

--comp,-comp:FeatureInput     Comparison VCF file(s)  This argument may be specified 0 or more times. Default value: 
                              null. 

--debug,-debug:Boolean        Print out very verbose debug information about each triggering active region  Default 
                              value: false. Possible values: {true, false} 

--useFilteredReadsForAnnotations,-useFilteredReadsForAnnotations:Boolean
                              Use the contamination-filtered read maps for the purposes of annotating variants  Default 
                              value: false. Possible values: {true, false} 

--emitRefConfidence,-ERC:ReferenceConfidenceMode
                              Mode for emitting reference confidence scores  Default value: NONE. Possible values: 
                              {NONE, BP_RESOLUTION, GVCF} 

--bamOutput,-bamout:String    File to which assembled haplotypes should be written  Default value: null. 

--bamWriterType,-bamWriterType:WriterType
                              Which haplotypes should be written to the BAM  Default value: CALLED_HAPLOTYPES. Possible 
                              values: {ALL_POSSIBLE_HAPLOTYPES, CALLED_HAPLOTYPES} 

--disableOptimizations,-disableOptimizations:Boolean
                              Don't skip calculations in ActiveRegions with no variants  Default value: false. Possible 
                              values: {true, false} 

--keepRG,-keepRG:String       Only use reads from this read group when making calls (but use all reads to build the 
                              assembly)  Default value: null. 

--justDetermineActiveRegions,-justDetermineActiveRegions:Boolean
                              Just determine ActiveRegions, don't perform assembly or calling  Default value: false. 
                              Possible values: {true, false} 

--dontGenotype,-dontGenotype:Boolean
                              Perform assembly but do not genotype variants  Default value: false. Possible values: 
                              {true, false} 

--dontUseSoftClippedBases,-dontUseSoftClippedBases:Boolean
                              Do not analyze soft clipped bases in the reads  Default value: false. Possible values: 
                              {true, false} 

--captureAssemblyFailureBAM,-captureAssemblyFailureBAM:Boolean
                              Write a BAM called assemblyFailure.bam capturing all of the reads that were in the active 
                              region when the assembler failed for any reason  Default value: false. Possible values: 
                              {true, false} 

--errorCorrectReads,-errorCorrectReads:Boolean
                              Use an exploratory algorithm to error correct the kmers used during assembly  Default 
                              value: false. Possible values: {true, false} 

--doNotRunPhysicalPhasing,-doNotRunPhysicalPhasing:Boolean
                              Disable physical phasing  Default value: false. Possible values: {true, false} 

--min_base_quality_score,-mbq:Byte
                              Minimum base quality required to consider a base for calling  Default value: 10. 

--min_mapping_quality_score,-mmq:Integer
                              Minimum read mapping quality required to consider a read for analysis with the 
                              HaplotypeCaller  Default value: 20. 

--group,-G:String             One or more classes/groups of annotations to apply to variant calls  This argument may be 
                              specified 0 or more times. Default value: [StandardAnnotation, StandardHCAnnotation]. 

--excludeAnnotation,-XA:StringOne or more specific annotations to exclude  This argument may be specified 0 or more 
                              times. Default value: null. 

--annotateNDA,-nda:Boolean    If provided, we will annotate records with the number of alternate alleles that were 
                              discovered (but not necessarily genotyped) at a given site  Default value: false. 
                              Possible values: {true, false} 

--heterozygosity,-hets:Double Heterozygosity value used to compute prior likelihoods for any locus.  See the GATKDocs 
                              for full details on the meaning of this population genetics concept  Default value: 
                              0.001. 

--indel_heterozygosity,-indelHeterozygosity:Double
                              Heterozygosity for indel calling.  See the GATKDocs for heterozygosity for full details 
                              on the meaning of this population genetics concept  Default value: 1.25E-4. 

--standard_min_confidence_threshold_for_calling,-stand_call_conf:Double
                              The minimum phred-scaled confidence threshold at which variants should be called  Default 
                              value: 30.0. 

--standard_min_confidence_threshold_for_emitting,-stand_emit_conf:Double
                              The minimum phred-scaled confidence threshold at which variants should be emitted (and 
                              filtered with LowQual if less than the calling threshold)  Default value: 30.0. 

--max_alternate_alleles,-maxAltAlleles:Integer
                              Maximum number of alternate alleles to genotype  Default value: 6. 

--max_genotype_count,-maxGT:Integer
                              Maximum number of genotypes to consider at any site  Default value: 1024. 

--input_prior,-inputPrior:Double
                              Input prior for calls  This argument may be specified 0 or more times. Default value: 
                              null. 

--sample_ploidy,-ploidy:Integer
                              Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in 
                              each pool * Sample Ploidy).  Default value: 2. 

--genotyping_mode,-gt_mode:GenotypingOutputMode
                              Specifies how to determine the alternate alleles to use for genotyping  Default value: 
                              DISCOVERY. Possible values: {DISCOVERY, GENOTYPE_GIVEN_ALLELES} 

--alleles,-alleles:FeatureInput
                              The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES  
                              Default value: null. 

--contamination_fraction_to_filter,-contamination:Double
                              Fraction of contamination in sequencing data (for all samples) to aggressively remove  
                              Default value: 0.0. 

--contamination_fraction_per_sample_file,-contaminationFile:File
                              Tab-separated File containing fraction of contamination in sequencing data (per sample) 
                              to aggressively remove. Format should be "<SampleID><TAB><Contamination>" (Contamination 
                              is double) per line; No header.  Default value: null. 

--p_nonref_model,-pnrm:AFCalculatorImplementation
                              Non-reference probability calculation model to employ  Default value: null. Possible 
                              values: {EXACT_INDEPENDENT, EXACT_REFERENCE, EXACT_ORIGINAL, EXACT_GENERAL_PLOIDY} 

--exactCallsLog,-logExactCalls:File
                              x  Default value: null. 

--output_mode,-out_mode:OutputMode
                              Specifies which type of calls we should output  Default value: EMIT_VARIANTS_ONLY. 
                              Possible values: {EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES, EMIT_ALL_SITES} 

--allSitePLs,-allSitePLs:Boolean
                              Annotate all sites with PLs  Default value: false. Possible values: {true, false} 

--readShardSize,-readShardSize:Integer
                              Maximum size of each read shard, in bases. For good performance, this should be much 
                              larger than the maximum assembly region size.  Default value: 5000. 

--readShardPadding,-readShardPadding:Integer
                              Each read shard has this many bases of extra context on each side. Read shards must have 
                              as much or more padding than assembly regions.  Default value: 100. 

--minAssemblyRegionSize,-minAssemblyRegionSize:Integer
                              Minimum size of an assembly region  Default value: 50. 

--maxAssemblyRegionSize,-maxAssemblyRegionSize:Integer
                              Maximum size of an assembly region  Default value: 300. 

--assemblyRegionPadding,-assemblyRegionPadding:Integer
                              Amount of additional bases of context to include around each assembly region  Default 
                              value: 100. 

--maxReadsPerAlignmentStart,-maxReadsPerAlignmentStart:Integer
                              Maximum number of reads to retain per alignment start position. Reads above this 
                              threshold will be downsampled. Set to 0 to disable.  Default value: 50. 

--activeProbabilityThreshold,-activeProbabilityThreshold:Double
                              Minimum probability for a locus to be considered active.  Default value: 0.002. 

--maxProbPropagationDistance,-maxProbPropagationDistance:Integer
                              Upper limit on how many bases away probability mass can be moved around when calculating 
                              the boundaries between active and inactive assembly regions  Default value: 50. 

--disable_all_read_filters,-f:Boolean
                              Disable all read filters  Default value: false. Possible values: {true, false} 

--intervals,-L:String         One or more genomic intervals over which to operate  This argument may be specified 0 or 
                              more times. Default value: null. 

--excludeIntervals,-XL:String One or more genomic intervals to exclude from processing  This argument may be specified 
                              0 or more times. Default value: null. 

--interval_set_rule,-isr:IntervalSetRule
                              Set merging approach to use for combining interval inputs  Default value: UNION. Possible 
                              values: {UNION, INTERSECTION} 

--interval_padding,-ip:IntegerAmount of padding (in bp) to add to each interval  Default value: 0. 

--readValidationStringency,-VS:ValidationStringency
                              Validation stringency for all SAM/BAM/CRAM/SRA files read by this program.  The default 
                              stringency value SILENT can improve performance when processing a BAM file in which 
                              variable-length data (read, qualities, tags) do not otherwise need to be decoded.  
                              Default value: SILENT. Possible values: {STRICT, LENIENT, SILENT} 

--secondsBetweenProgressUpdates,-secondsBetweenProgressUpdates:Double
                              Output traversal statistics every time this many seconds elapse  Default value: 10.0. 

--disableSequenceDictionaryValidation,-disableSequenceDictionaryValidation:Boolean
                              If specified, do not check the sequence dictionaries from our inputs for compatibility. 
                              Use at your own risk!  Default value: false. Possible values: {true, false} 

--createOutputBamIndex,-createOutputBamIndex:Boolean
                              If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file  Default 
                              value: true. Possible values: {true, false} 

--createOutputBamMD5,-createOutputBamMD5:Boolean
                              If true, create a MD5 digest for any BAM/SAM/CRAM file created  Default value: false. 
                              Possible values: {true, false} 

--addOutputSAMProgramRecord,-addOutputSAMProgramRecord:Boolean
                              If true, adds a PG tag to created SAM/BAM/CRAM files.  Default value: true. Possible 
                              values: {true, false} 

--TMP_DIR:File                Undocumented option  This argument may be specified 0 or more times. Default value: null. 

--help,-h:Boolean             display the help message  Default value: false. Possible values: {true, false} 

--version:Boolean             display the version number for this tool  Default value: false. Possible values: {true, 
                              false} 

--arguments_file:File         read one or more arguments files and add them to the command line  This argument may be 
                              specified 0 or more times. Default value: null. 

--verbosity,-verbosity:LogLevel
                              Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING, 
                              INFO, DEBUG} 

--QUIET:Boolean               Whether to suppress job-summary info on System.err.  Default value: false. Possible 
                              values: {true, false} 

--use_jdk_deflater,-jdk_deflater:Boolean
                              Whether to use the JdkDeflater (as opposed to IntelDeflater)  Default value: false. 
                              Possible values: {true, false} 
virtuabhi commented 7 years ago

Would the method callRegion in HaplotypeCallerEngine https://github.com/broadinstitute/gatk-protected/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java#L442 be the correct place to add this feature?

vdauwera commented 7 years ago

Hi @virtuabhi, we're currently running evaluations on the HaplotypeCaller port to GATK4, so we're currently not making or accepting any changes to it. Once that is done we'll determine what features are missing/needed and how they should be implemented. Among other things, there are some things that might change in the active region traversal machinery.

virtuabhi commented 7 years ago

Thanks