broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

SelectVariants appears to produce some incongruities in header lines #4252

Open jamesemery opened 6 years ago

jamesemery commented 6 years ago

I have noticed that when we run a vcf through select variants and compare it with the original, that there can be a number of differences in the header, namely the following problems: -It appears to be throwing away all but one of the ##GATKCommandLine fields=... -It drops the ##reference=... field. -For the sequence dictionary it is dropping the assembly attribute for contigs.

cmnbroad commented 6 years ago

There is an existing ticket thats a more general version of this same issue. We really should have a canonical solution for what is retained from the input header - I'm fairly confident every VCF-writing tool in GATK does something slightly different. The specific behaviors listed above were all deliberate, though they should probably be revisited. My vague recollections about each:

ldgauthier commented 6 years ago

I need GATKCommandLine headers from multiple runs of the same tool. The production joint calling workflow calls ApplyVQSR twice (once for SNPs, once for INDELs) and I need those lines for complete provenance of the data.

droazen commented 6 years ago

We definitely need to ensure that we're retaining all GATKCommandLine lines, even if there are multiple runs of the same tool.

droazen commented 6 years ago

Note that GATK3 uniquified the GATKCommandLine lines using a scheme like this:

##GATKCommandLine.SelectVariants.2=<ID=SelectVariants,Version=3.4-228-g2497091,Date="Tue Jan 05 13:48:45 EST 2016",Epoch=1452019725506,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=[3:113005755-195507036] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=true never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/humgen/gsa-hpprojects/dev/gauthier/AS_GTEx/GTEx_AS/GTEx_AS.recal.multialleleics.AS.recalibrated.mixed.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/humgen/gsa-hpprojects/dev/gauthier/AS_GTEx/VQSR.AStest.input.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.4-228-g2497091,Date="Tue Jan 05 13:29:45 EST 2016",Epoch=1452018585661,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/humgen/gsa-hpprojects/dev/gauthier/AS_GTEx/GTEx_AS/GTEx_AS.recal.multialleleics.AS.recalibrated.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/humgen/gsa-hpprojects/dev/gauthier/AS_GTEx/GTEx_AS/GTEx_AS.recal.multialleleics.AS.recalibrated.mixed.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[MIXED] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
dariober commented 6 years ago

Not sure if this should be a separate tickets... It appears that SelectVariants duplicates the INFO/AF tag in the header, this results in an invalid vcf. At least according to some tools like rtg. This is with gatk 4.0.1.0.

Example:

zcat gnomad.genomes.r2.0.2.sites.hg38.chr18.vcf.gz | grep 'ID=AF,'
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency among genotypes, for each ALT allele, in the same order as listed">
gatk SelectVariants -V gnomad.genomes.r2.0.2.sites.hg38.chr18.vcf.gz -O gnomad.r2.0.2.biallelic.hg38.chr18.vcf.gz --restrict-alleles-to BIALLELIC

zcat gnomad.r2.0.2.biallelic.hg38.chr18.vcf.gz | grep 'ID=AF,'
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency among genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
rtg vcfsubset -i gnomad.r2.0.2.biallelic.hg38.chr18.vcf.gz --keep-info AF -o - 
Error: Invalid VCF header. VCF header contains multiple field declarations with the same ID=AF
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency among genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> on line:##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
cmnbroad commented 6 years ago

The duplicate line issue is one of a cluster of problems with htsjdk's header handling that will be fixed by https://github.com/samtools/htsjdk/pull/835. IIRC, this particular one happens when the input file has an AF line that differs from the standard one used by htsjdk. In this case, the description differs, so htsjdk incorrectly roundtrips both. I believe if the input file contained the exact line used by htsjdk, it would not be duplicated in the output:

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

dariober commented 6 years ago

@cmnbroad thanks for the additional info. Some more detail from my side in case others stumble upon the same problem...

bcftools annotate -O z -i 'INFO/AF > 0' -x ^INFO/AF gnomad.r2.0.2.biallelic.hg38.chr18.vcf.gz > gnomad.r2.0.2.simple.hg38.chr18.vcf.gz
Redmar-van-den-Berg commented 4 years ago

Has there been any progress on this issue? I'm still running into problems using SelectVariant where it duplicates the DP field when the description differs, leading to invalid VCF files. The lines from the VCF file

##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">

The gatk version

The Genome Analysis Toolkit (GATK) v4.1.8.1
HTSJDK Version: 2.23.0
Picard Version: 2.22.8
cmnbroad commented 3 years ago

In reviewing https://github.com/broadinstitute/gatk/pull/7069, I noticed another issue, which is that AC, AF, AN, DP header lines are all added to the output header unconditionally, whether they exist in the input or not. (Prior to #7068, there was code that added them conditionally based on the value of setFilteredGenotypesToNocall by calling GATKVariantContextUtils.addChromosomeCountsToHeader, but there was also subsequent code later on that added them unconditionally and directly with inline code.