Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

HGNC id missing in some variant contexts #1225

Open apregier opened 2 years ago

apregier commented 2 years ago

Describe the issue

Using the --symbol option the hgnc_id is sometimes filled in and sometimes not for certain transcripts in certain variants, depending on what variants come before. I am able to replicate this on the web version as well

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

Command line equivalent (copied from website):
./vep --af --appris --biotype --buffer_size 500 --check_existing --distance 5000 --mane --polyphen b --pubmed --regulatory --sift b --species homo_sapiens --symbol --transcript_version --tsl --cache --input_file [input_data] --output_file [output_file] --port 3337

If you copy in the following variants: chr11 111711449 . C CATTCTTTTTTACTTATTAAA chr11 112025769 . G GATAAATCTATT chr11 112064276 . T TAAATAAATA You will see that the last two variants have consequences on SDHD with hgnc id listed as 10683 (although it says the symbol source is Uniprot_gn, that is the correct HGNC ID for that SDHD)

However, if you just do the third variant, or just the second and third variant, you will see that the hgnc id field is blank. One of these has to be incorrect - I am assuming that it is the one where it is missing, but there might be some reason why it should be blank - but then it should always be blank for the same variant and allele, right?

Full error message

No error message

Data files (if applicable)

Output for all three variants (has HGNC ids): 32RN7DMTzMOH9fxl.vcf.txt

Output for just two variants (missing HGNC ids): ZtMlnYeUXxVdM4mY.vcf.txt

nuno-agostinho commented 2 years ago

Hey @apregier, thanks for reporting this issue!

I am not really sure why this is giving inconsistent results, but I am going to look into this.

King regards, Nuno

nuno-agostinho commented 2 years ago

Hey @apregier! Sorry for the delay.

After checking your results, it seems that SDHD is a gene symbol that matches two different gene identifiers in GRCh37:

In your run, variants only have consequences for the second gene (ENSG00000255292). As such, the HGNC should be empty in your results and that is why the source displays as Uniprot_gn instead of HGNC. It seems that VEP is incorrectly filling the HGNC based on gene symbol instead of gene identifier.

We will try to fix this issue in a future version. Thank you so much for pointing out this issue!

Hope that makes it clear; otherwise, please feel free to reply back. Have a great day!

Cheers, Nuno