broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Design a scheme for storing (potentially per-transcript) functional annotations in the VCF INFO field #3282

Closed droazen closed 7 years ago

droazen commented 7 years ago

Can look at what other similar tools have done:

SnpEff in particular already has a scheme for annotating the VCF INFO field with info from all transcripts.

jonn-smith commented 7 years ago

SnpEff has a document with a reasonable spec for annotations:

http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

jonn-smith commented 7 years ago

After getting comments, it looks like the format we're going to use is adding the annotations in the info field, grouping by allele, then transcript. The specifics of annotation name and delimiters can be debated later (and easily changed).

For example (newlines added for readability):

22 21807795 . G C,A . . DP=1000;ECNT=40;IN_PON;NLOD=59.66,68.06;N_ART_LOD=8.88,1.93;TLOD=16.45,36.72;TLOD_FWD=-1.392e+00;TLOD_REV=17.84;TUMOR_SB_POWER_FWD=0.558;TUMOR_SB_POWER_REV=0.724; VC= C|missense_variant|MODERATE|MAPK1|ENSG00000100030|Transcript|ENST00000215832|protein_coding|2/9||||360|171|57|S/R|agC/agG|||-1||HGNC|HGNC:6871||Ensembl C|missense_variant|MODERATE|MAPK1|ENSG00000100030|Transcript|ENST00000398822|protein_coding|2/8||||411|171|57|S/R|agC/agG|||-1||HGNC|HGNC:6871||Ensembl C|missense_variant|MODERATE|MAPK1|ENSG00000100030|Transcript|ENST00000544786|protein_coding|2/7||||171|171|57|S/R|agC/agG|||-1||HGNC|HGNC:6871||Ensembl C|missense_variant|MODERATE|MAPK1|5594|Transcript|NM_002745.4|protein_coding|2/9||||411|171|57|S/R|agC/agG|||-1||EntrezGene|HGNC:6871|rseq_mrna_match|RefSeq C|missense_variant|MODERATE|MAPK1|5594|Transcript|NM_138957.3|protein_coding|2/8||||411|171|57|S/R|agC/agG|||-1||EntrezGene|HGNC:6871|rseq_mrna_match|RefSeq A|synonymous_variant|LOW|MAPK1|ENSG00000100030|Transcript|ENST00000215832|protein_coding|2/9||||360|171|57|S|agC/agT|||-1||HGNC|HGNC:6871||Ensembl A|synonymous_variant|LOW|MAPK1|ENSG00000100030|Transcript|ENST00000398822|protein_coding|2/8||||411|171|57|S|agC/agT|||-1||HGNC|HGNC:6871||Ensembl A|synonymous_variant|LOW|MAPK1|ENSG00000100030|Transcript|ENST00000544786|protein_coding|2/7||||171|171|57|S|agC/agT|||-1||HGNC|HGNC:6871||Ensembl A|synonymous_variant|LOW|MAPK1|5594|Transcript|NM_002745.4|protein_coding|2/9||||411|171|57|S|agC/agT|||-1||EntrezGene|HGNC:6871|rseq_mrna_match|RefSeq A|synonymous_variant|LOW|MAPK1|5594|Transcript|NM_138957.3|protein_coding|2/8||||411|171|57|S|agC/agT|||-1||EntrezGene|HGNC:6871|rseq_mrna_match|RefSeq