Open heuermh opened 3 years ago
Hi @heuermh,
By default, VEP uses the CSQ
key in the INFO field to write consequence and other annotation data. Due to consumers having issues with some downstream tools requiring INFO fields with a specific key, we introduced the --vcf_info_field
flag to allow users to change this key for ease of integration with these downstream tools.
Setting this value to ANN does not change the formatting of our standard VCF output. However, we do try to match the VCF specification as much as we can, so I would like to investigate any discrepancies that we have here. Thank you for providing the summary table, I'll have a chat with the team about these differences and whether there's anything that we could make clearer.
Could you please give me an example of when the cDNA position provided by VEP doesn't match this specification? I'm unable to reproduce VEP results that provide a start-end
result rather that just a start
result.
In the meantime, the header lines within VEP output are the canonical source of VEP INFO field descriptions within a particular output file, and more information on any of these output fields can be found here: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output
Kind Regards, Andrew
Hello @aparton, thank you for the clarification!
I thought that the VEP team had a part in drafting the VCF ANN specification, so was confused that switching to use the ANN
flag didn't change the fields to match. We can simply use CSQ
instead of trying to conditionally parse an ANN
field value that doesn't match up with the specification or what snpeff produces.
As an example of what I meant around cDNA position, note below 2334-2337
instead of 2334
or 2334/3583
as would be expected by the VCF ANN specification.
Header
##VEP="v101" time="2020-10-28 16:20:09" cache="/data/vep/homo_sapiens/101_GRCh38" ensembl=
101.856c8e8 ensembl-funcgen=101.b918a49 ensembl-io=101.943b6c2 ensembl-variation=101.50e7372
1000genomes="phase3" COSMIC="90" ClinVar="202003" ESP="V2-SSA137" HGMD-PUBLIC="20194"
assembly="GRCh38.p13" dbSNP="153" gencode="GENCODE 35" genebuild="2014-07" gnomAD="r2.1"
polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=ANN,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP.
Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON
|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation
|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
Example row
21 25891745 . TCTCT TCACACA 60.1096 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;
CIGAR=1M2I1M1X1M1X;DP=2;DPB=2.8;DPRA=0;EPP=7.35324;EPPR=0;GTI=0;LEN=7;MEANALT=1;
MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;
PRO=0;QA=76;QR=0;RO=0;RPL=1;RPP=3.0103;RPPR=0;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;
SRF=0;SRP=0;SRR=0;TYPE=complex;technology.ILLUMINA=1;
ANN=
CACACA|frameshift_variant&stop_lost|HIGH|APP|ENSG00000142192|Transcript|
ENST00000346798|protein_coding|17/18||||2334-2337|2184-2187|728-729|
*E/YVX|taAGAG/taTGTGTG|||-1||HGNC|HGNC:620,
CACACA|frameshift_variant&stop_lost|HIGH|APP|ENSG00000142192|Transcript|
ENST00000348990|protein_coding|15/16||||2106-2109|1959-1962|653-654|
*E/YVX|taAGAG/taTGTGTG|||-1||HGNC|HGNC:620,
...
Hello,
Is it the intention of
--vcf_info_field ANN
to adhere to the VCF ANN specification? While some fields do match, there are several that do not, and more troublesome for consumers some that do match are formatted differently (e.g. cDNA position should bestart(optionally /length)
and instead appears to bestart-end
).http://grch37.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf_info_field
http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf
https://pcingola.github.io/SnpEff/se_inputoutput/#ann-field-vcf-output-files
I've attempted to draft a summary table at https://github.com/heuermh/bdg-formats/blob/docs/docs/source/transcript-effects.md