Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
439 stars 150 forks source link

How to annotate my variants to uniprot-defined canonical proteins? #1449

Open Heredia-Maria opened 1 year ago

Heredia-Maria commented 1 year ago

Hi dear Ensembl team,

I'm having some issues because I would like to annotate my genetic variants based on the canonical isoforms from UniProt, but I can't find a way to do it. Sometimes the MANE SELECT transcript or Ensembl Canonical does not match the preferred isoform entry in UniProt. This is a problem because I want to annotate UniProt features afterward, so I need the sequence being annotated by VEP to match that of UniProt. Can you help me to do that please? I am trying different ways... But I want to be sure the process is being done correctly.

I'm using VEP release 109

my command is as follows: ./vep --offline --cache --fa Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --format vcf --tab -i ./../Escritorio/DataBasesData/clinvar_20230213.vcf --show_ref_allele --total_length --mane --verbose --variant_class --force_overwrite --hgvs --symbol --uniprot --gencode_basic --canonical --biotype --exclude_predicted --no_intergenic --protein --shift_3prime 1 --pick --pick_order ccds,mane_plus_clinical,mane_select,canonical,appris,tsl,biotype,rank,length --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNDN -o NMD_20230213_output_prioritization_by_ccds.txt -plugin Downstream -plugin ProteinSeqs,references.fa,mutated.fa -plugin NMD

My input is an vcf from clinvar

Thank you in advance.

likhitha-surapaneni commented 1 year ago

Hi @Heredia-Maria, a UniProt isoform is listed only when there is an exact match to the transcript whether it is canonical or not. This maybe the reason why some of the canonical transcripts do not have an entry for UniProt isoform.

Kind regards, Likhitha

Heredia-Maria commented 1 year ago

Hi @likhitha-surapaneni,

First, thank you very much for your response. I don't understand exactly whether the problem is due to the fact that there is no an exact correspondence between Ensemble and Uniprot isoforms or if I am not ussing the command line correctly .

I think the issue will be better illustrated by some example, as in the case of variants mapping SCN5A . Here, when I modify the command pick_orderd, I get my variants annotated either within ENST00000413689.6 (MANE Plus Clinical, UniProt ID: H9KVD2); or ENST00000423572.7 (MANE Select, UniProt ID: Q14524-2). However, I would like to annotate them within the canonical isoform selected by UniProt, which is ENST00000333535.9 (UniProtID: Q14524-1). Is there a way to annotate my variants following this criteria of UniProt-defined canonical?

Kind regads, María.

likhitha-surapaneni commented 1 year ago

Hi @Heredia-Maria , thank you for providing us with an example. I would suggest the following if you would like annotations only for UniProt-defined canonical:

This should provide you annotations only for UniProt-defined canonical ids.

Kind regards, Likhitha

Heredia-Maria commented 1 year ago

Hi @likhitha-surapaneni ,

that's a great idea! I will try it and give you my feedback. Thank you very much

María.

Heredia-Maria commented 1 year ago

Hi @likhitha-surapaneni

I have tried the approach you proposed, but with a code in Perl language, and it works. Now the thing is that I am loosing some proteins because Ensemble does not have annotation for the Uniprot-defined canonical isoforms. A good example of this is NLRP3 (ENSG00000162711, UniProtID: Q96P20). In this case, Ensemble only accounts with TrEMBL Isoforms. I think this would be more complicated but is there any way to retrieve this kind of proteins?

Thank you very much, Kind regards.

María.