broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 581 forks source link

Sanity-check version in GVCFs before joint genotyping? #2129

Open vdauwera opened 7 years ago

vdauwera commented 7 years ago

A user rightly points out that different versions of HaplotypeCaller may produce GVCFs that are not directly compatible, causing weirdness when you joint-genotype them with GenotypeGVCFs.

Obviously this is primarily a data management problem (user should control what's in their pipeline) -- but it would be good to provide an additional safety layer by having GenotypeGVCFs, CombineGVCFs or whatever demon is used to invoke TileDB at least emit a WARN message if they see GVCFs produced by different versions of HC within the same input cohort.

Note that the VCF version number is not directly useable for this purpose since changes in the contents of GVCFs can arise within the same version of VCF spec.

Also, one could argue that the GVCFs really should all be produced using exactly the same command line arguments -- but validating the entire command line would probably be overkill...

vdauwera commented 7 years ago

@yfarjoun I believe you always have opinions about validation, wdyt?

yfarjoun commented 7 years ago

Nice idea. I think that GATK version of HC should be the same or else: warning.

In-fact, the headers can be further compared, to check for bands being equal for example.

yfarjoun commented 7 years ago

annotations all the same..

vdauwera commented 7 years ago

Yeah I was thinking of same annotations too.

droazen commented 7 years ago

Assigning to @lbergelson as part of his GenotypeGVCFs work. This is a check that could be added after we tie-out that tool.