Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

vcf_info_field ANN output does not adhere to VCF ANN Specification #899

Open heuermh opened 3 years ago

heuermh commented 3 years ago

Hello,

Is it the intention of --vcf_info_field ANN to adhere to the VCF ANN specification? While some fields do match, there are several that do not, and more troublesome for consumers some that do match are formatted differently (e.g. cDNA position should be start(optionally /length) and instead appears to be start-end).

http://grch37.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf_info_field

http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

https://pcingola.github.io/SnpEff/se_inputoutput/#ann-field-vcf-output-files

I've attempted to draft a summary table at https://github.com/heuermh/bdg-formats/blob/docs/docs/source/transcript-effects.md

aparton commented 3 years ago

Hi @heuermh,

By default, VEP uses the CSQ key in the INFO field to write consequence and other annotation data. Due to consumers having issues with some downstream tools requiring INFO fields with a specific key, we introduced the --vcf_info_field flag to allow users to change this key for ease of integration with these downstream tools.

Setting this value to ANN does not change the formatting of our standard VCF output. However, we do try to match the VCF specification as much as we can, so I would like to investigate any discrepancies that we have here. Thank you for providing the summary table, I'll have a chat with the team about these differences and whether there's anything that we could make clearer.

Could you please give me an example of when the cDNA position provided by VEP doesn't match this specification? I'm unable to reproduce VEP results that provide a start-end result rather that just a start result.

In the meantime, the header lines within VEP output are the canonical source of VEP INFO field descriptions within a particular output file, and more information on any of these output fields can be found here: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output

Kind Regards, Andrew

heuermh commented 3 years ago

Hello @aparton, thank you for the clarification!

I thought that the VEP team had a part in drafting the VCF ANN specification, so was confused that switching to use the ANN flag didn't change the fields to match. We can simply use CSQ instead of trying to conditionally parse an ANN field value that doesn't match up with the specification or what snpeff produces.

As an example of what I meant around cDNA position, note below 2334-2337 instead of 2334 or 2334/3583 as would be expected by the VCF ANN specification.

Header

##VEP="v101" time="2020-10-28 16:20:09" cache="/data/vep/homo_sapiens/101_GRCh38" ensembl=
101.856c8e8 ensembl-funcgen=101.b918a49 ensembl-io=101.943b6c2 ensembl-variation=101.50e7372
 1000genomes="phase3" COSMIC="90" ClinVar="202003" ESP="V2-SSA137" HGMD-PUBLIC="20194"
assembly="GRCh38.p13" dbSNP="153" gencode="GENCODE 35" genebuild="2014-07" gnomAD="r2.1"
polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=ANN,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP.
Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON
|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation
|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">

Example row

21  25891745    .   TCTCT   TCACACA 60.1096 .   AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;
CIGAR=1M2I1M1X1M1X;DP=2;DPB=2.8;DPRA=0;EPP=7.35324;EPPR=0;GTI=0;LEN=7;MEANALT=1;
MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;
PRO=0;QA=76;QR=0;RO=0;RPL=1;RPP=3.0103;RPPR=0;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;
SRF=0;SRP=0;SRR=0;TYPE=complex;technology.ILLUMINA=1;

ANN=
CACACA|frameshift_variant&stop_lost|HIGH|APP|ENSG00000142192|Transcript|
ENST00000346798|protein_coding|17/18||||2334-2337|2184-2187|728-729|
*E/YVX|taAGAG/taTGTGTG|||-1||HGNC|HGNC:620,

CACACA|frameshift_variant&stop_lost|HIGH|APP|ENSG00000142192|Transcript|
ENST00000348990|protein_coding|15/16||||2106-2109|1959-1962|653-654|
*E/YVX|taAGAG/taTGTGTG|||-1||HGNC|HGNC:620,

...