broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Malformed VCF for "AS_RAW_*" with "Number=1" after HaplotypeCaller #4162

Closed landesfeind closed 6 years ago

landesfeind commented 6 years ago

Hi, I ran into the following error when combining gVCF files generated by the HaplotypeCaller:

htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chr1:13417. Field had 2 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias"

Similar number info was found for several allele-specific annotations:

##INFO=<ID=AS_InbreedingCoeff,Number=A,Type=Float,Description="allele specific heterozygosity as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation; relate to inbreeding coefficient">
##INFO=<ID=AS_QD,Number=A,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQ,Number=1,Type=String,Description="Allele-specfic raw data for RMS Mapping Quality">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specfic raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests">

I assume, the correct annotation should be "Number=A".

The gVCF files were generated with HaplotypeCaller using

   --emit-ref-confidence GVCF \
   --annotation-group StandardAnnotation \
   --annotation-group AS_StandardAnnotation \
   --annotation-group StandardHCAnnotation \

GATK version 4.0.0.0 (downloaded from GATK website)

ldgauthier commented 6 years ago

I take it you're using GenomicsDBImport to combine the GVCFs? GenomicsDB doesn't support the allele-specific annotations just yet, but look for an update very soon: #3707 In the meantime, you can use CombineGVCFs in place of GenomicsDBImport if you need those allele-specific annotations.

landesfeind commented 6 years ago

Thank you for the quick response! Yes, I used GenomicsDBImport and I have seen the issue you mentioned but did not relate to mine.

First, the description stated "allele-specific" so I expected "Number=A" (or R). Second (and regardless of the support by GenomicsDB), I don't think the generated VCF follows specification if "Number=1" and the data gives multiple values. If you are aware of this and this is by design, I can close this issue.

ldgauthier commented 6 years ago

I am aware and it is by design. The raw data for the allele-specific annotations can be lists of lists (e.g. AS_SB_table), but the VCF spec doesn't support that. As such, I chose to declare all the raw data to be strings and let the GATK code have the responsibility of parsing.

landesfeind commented 6 years ago

I see! Thanks so much for explaining it thoroughly!