Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

upstream_gene_variant with distance 0 #1147

Closed 0xaf1f closed 1 month ago

0xaf1f commented 2 years ago

Describe the issue

We noticed a variant, a single base insertion after the first position of a gene, getting annotated as an upstream_gene_variant with respect to that gene. The distance is then 0 since it's not actually upstream, but why isn't it instead called a frameshift mutation?

This is the problematic annotation:

A|upstream_gene_variant|MODIFIER|dxs2|Rv3379c|Transcript|CCP46200|protein_coding|||||||||||0|-1||1|insertion|ENA_GENE|

Additional information

System

Full VEP command line

vep --force_overwrite --dir /path/to/.vep --synonyms /path/to/.vep/synonyms.txt --offline --cache --cache_version 30 --species mycobacterium_tuberculosis_h37rv --symbol --variant_class --flag_pick --vcf -i test-case.vcf -o test-case.annotated.vcf

Full error message

N/A

Data files (if applicable)

vep-distance-issue.tar.gz

diegomscoelho commented 2 years ago

Hi @0xaf1f,

We are currently investigating your issue. I will post an answer here shortly.

Thanks for your question.

Regards, @diegomscoelho

nakib103 commented 1 month ago

Hello @0xaf1f,

Sorry for delay in re-visiting this issue. I cannot check the exact example you provided but can re-produce the case in human GRCh38 with input -

1   230714122   .   C   CT

results in upstream_gene_variant with DISTANCE=0

#Uploaded_variation Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra
1_230714123_-/T 1:230714122-230714123   T   ENSG00000135744 ENST00000366667 Transcript  upstream_gene_variant   -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=0;STRAND=-1

For insertions it makes sense. As the variant position is considered to be the flanking bases where the insertion actually happens (see here) making start (or end) of the variant position same as transcript start (or end). The inserted sequence itself is outside the transcript so it is correct to say the effect is upstream.

I have added a PR to make it clear in the doc here - https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout

As the issue has been stale for long time I will close it.

Best regards, Nakib

0xaf1f commented 1 month ago

Thanks -- in my example data, the gene for which this was annotated as distance 0 was has coordinates 3793257-3794867 on the minus strand.

Because the variant is

1       3794867 .       C       CA 

you're right that it is in fact upstream-- just before the start. Although the distance=0 is still odd. I'd have expected a distance 1 perhaps, but I suppose that's not as big a deal.