broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Annotation feature request: ability to annotate "set" wrt resource files #2489

Open vdauwera opened 7 years ago

vdauwera commented 7 years ago

Feature request

Since CombineVariants will not be ported, we need equivalent functionality to its ability to annotate "set", ie which callset(s) a site is present in. Here is an excerpt from a tutorial that describes this functionality in action:


To find out which set each variant belongs to, we can use CombineVariants. CombineVariants has a way to annotate each site with which set the site belongs to. For example, if a site is in GIAB and failed hard filtering but passed VQSR, CombineVariants will annotate the site with set=G-filterInH-V. The "filterIn" flag before the filtering method tells us the site failed the filtering method, hence it was "filtered" in the set.

java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R ref/human_g1k_b37_20.fasta \
-V:G truth_dataset/NA12878.GIAB.vcf \
-V:H vcfs/NA12878.hard.filtered.vcf \
-V:V vcfs/NA12878.VQSR.filtered.vcf \
-o sandbox/NA12878.Combined.vcf 

The set-annotated VCF looks like this:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT INTEGRATION NA12878
20 61795 rs4814683 G T 2034.16 PASS AC=2;AF=0.500;AN=4;(...);set=Intersection​ ​ GT:AD:ADALL:DP :GQ:PL 0/1:218,205:172,169:769:99 0/1:30,30:.:60:99:1003,0,1027

In this record, "set=Intersection​" indicates this record was present and unfiltered in all callsets considered.

Here is a key of all the possible combinations for this 3-way venn:

Meaning Annotation
In GIAB only G
In GIAB and failed VQSR only G-H-filterInV
In GIAB and failed both hard filtering and VQSR G-filterInH-filterInV
In GIAB and failed hard filtering only G-filterInH-V
In GIAB and passed both hard filtering and VQSR Intersection
Not in GIAB and failed VQSR only H-filterInV
Not in GIAB and failed both hard filtering and VQSR FilteredInAll
Not in GIAB and failed hard filtering only filterInH-V
Not in GIAB and passed both hard filtering and VQSR H-V
vdauwera commented 7 years ago

Would be awesome to implement this in a way that allows running the annotator again with a new resource callset to update the set annotation.

The one limitation of this approach I can think of, compared to the combineVariants functionality, is that it is callset-centric, ie it won't tell us what is present in the resources (and potentially common to multiple resources) but not in our input callset. But I can live with that as long as it is well documented.

vdauwera commented 7 years ago

A user notes rightly that it would be more convenient to have the sets annotated using a comma-separated list instead of a single string with dashes, so they can be parsed more readily.

droazen commented 7 years ago

Adding to annotation epic #3274