Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

Stop_gained not reported #1668

Closed KevinMer closed 1 month ago

KevinMer commented 2 months ago

Describe the issue

Hello,

I'm facing two issues about consequences reported by VEP for two variants. Stop_gained is not reported for both while the mutation leads to a shortened transcript. I'm only looking for NM_004448.4 and NM_002524.5.

I extracted the CDS from refseq for the both variants and then I used ORFfinder from NCBI to get the protein based on the wild-type sequence and the modified one. I join in the fasta file the corresponding amino acids sequences.

As you can see, both are shortened transcript. Why are they not reported as stop_gained ? Besides, for the ERBB2 variant (chr17 37880991), stop_lost is reported which I do not understand. Is there a reason for that ?

Thanks a lot for your help.

Best regards,

Kevin

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

vep \
--fork 8 \ 
--offline \
--refseq \
--cache \
--dir_cache //mnt/beegfs/EH/VEP/ \
--cache_version 111 \
--fasta /mnt/beegfs/common/annotations/pipelines_EH/Human/hg19/hg19.fa \
--format vcf \
--vcf \
--hgvsp_use_prediction \
--hgvs \
-i test.vcf \
-o test_vep.vcf

Data files (if applicable)

They include:

nakib103 commented 2 months ago

Hello @KevinMer,

Thanks for your query!

For the first variant, the transcript does not seem like shortened checking from the fasta file you have provided (1777aa > 1255aa). So it is correct to report stop_lost.

For the second variant, it is a frameshift_variant and in this case shortens the transcript. Frameshift already implies shortening / elongating of the transcript. We do not report further information like stop_gained or stop_lost on top of that currently. We will discuss in the team on whether to change this behaviour in the future.

Hope that answers your question!

Best regards, Nakib

KevinMer commented 2 months ago

Hi @nakib103,

thanks for your reply.

For the first variant, I made a mistake in the fasta file. The protein has a length of 1177 aa not 1777. So, it's shortened than the wild-type one.

For the second variant, ok I see thanks for this answer.

Could you check the first variant now that I corrected my mistake ?

Thanks a lot for your help.

Best regards, Kevin

nakib103 commented 1 month ago

Hi @KevinMer,

Thanks for correcting the length 👍. I have re-visited the issue and VEP logic for assigning stop_lost.

In VEP, we do not try to figure out the sequence length in case of indel variant. The reason is that it can be tricky to figure out what will happen in vivo. For example, how long should we check for a stop codon in altered sequence? does the ribosome consider the next naturally occurring stop codon? does the protein gets picked up in the NMD pathway and potentially have no impact.

So, currently, VEP will report stop_lost if an indel removes the current stop codon and leaves out if another stop codon is gained further in or further out. As you pointed out there are tools to figure out the mutated transcript and it’s length.

In future, we might look into improve this behaviour as more is understood of the biology. Thanks for reporting it to us.

Best regards, Nakib

KevinMer commented 1 month ago

Hi @nakib103,

thanks a lot for your answer. Everything is clear now.

Have a nice day !

Best regards

Kevin