Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
452 stars 151 forks source link

Outdated Gene name from HGNC for GRCh37 #751

Closed seunghun23 closed 4 years ago

seunghun23 commented 4 years ago

Hi,

I am using the latest version and cache for VEP (v.100) on GRCh37 and noticed that when using --SYMBOL option, some of the gene names in the SYMBOL column are outdated one. I learned from another thread here that it is because the HGNC gene symbols hadn't been updated since the release 76 and you recommended that people use HGNC_ID instead as a gene identifier. However, when l searched gene name using the HGNC_ID, the ID was still linked to the outdated gene names for a location. For example, if you look at the table below, gene names in the "Gencode_all_genes" column are the most updated gene names on Ensembl/Gencode19 that I annotated using genomic location of variants, and for those variants annotated as MMP23B according to HGNC, if I look for their assigned HGNC_ID, 7171, it would give me MMP23B, but not CDK11B. Looking at these example, I don't understand how I am supposed to use HGNC_ID as a gene identifier when it still gives me an outdated gene name. I would really appreciate it if you could help me with this issue.

Best, Seunghun Han

image

helensch commented 4 years ago

Hi

Information on gene naming is available at: https://grch37.ensembl.org/info/genome/genebuild/gene_names.html As noted: "..previous symbols will be maintained as ‘synonyms’, however we recommend using the HGNC ID to ensure stability in your pipelines and analyses."

The approved symbol for HGNC_ID: 7171 is MMP23B The approved symbol for HGNC_ID: 1729 is CDK11B

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/7171 MMP23B https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/1729 CDK11B

The HGNC_ID of 7171 is linking to the current approved symbol of MMP23B.

VEP reports consequences for overlapping transcripts. For the variant chr1:1571464-1571465 in your example above, VEP reports consequences for transcripts from 2 genes:

Consequence SYMBOL  Gene    Existing_variation  SYMBOL_SOURCE   HGNC_ID
downstream_gene_variant MMP23B  ENSG00000189409 rs755205179 HGNC    7171
synonymous_variant  CDK11B  ENSG00000248333 rs755205179 HGNC    1729

Please let us know if you have any futher queries.

Regards Helen

seunghun23 commented 4 years ago

Hi Helen,

Thank you for your response, but in my output of VEP, it only has the first feature of downstream_gene_variant in MMP23B, but not the correct one in CDK11B. Below is how I run VEP, and is there an option I need to have the VEP report the two transcripts instead of the one I have?

vep -v -i /cromwell_root/fc-secure-ab48ab79-898a-4ab9-91de-f4a1f119d67c/reference_files/test.vcf.gz -o RCC_test.vep.txt \ --tab \ --offline --cache --merged --dir opt/vep/.vep --fasta /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/reference/hg19/fasta/Homo_sapiens_assembly19.fasta \ --use_given_ref /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/reference/hg19/fasta/Homo_sapiens_assembly19.fasta --force_overwrite --stats_text --symbol --canonical --everything \ --regulatory \ --total_length --numbers --domains --pick --variant_class --hgvs --hgvsg --ccds --plugin MPC,opt/vep/.vep/Plugins/data/fordist_constraint_official_mpc_values_v2.txt.gz --plugin dbNSFP,opt/vep/.vep/Plugins/data/dbNSFP_hg19.gz,ExAC_Adj_AC,ExAC_nonTCGA_Adj_AC,ExAC_nonTCGA_Adj_AF,ExAC_Adj_AF,gnomAD_genomes_AC,gnomAD_genomes_AN,gnomAD_genomes_AF,REVEL_score,M-CAP_score,MetaSVM_score,MetaLR_score,GenoCanyon_score,integrated_fitCons_score,clinvar_rs,Interpro_domain,GTEx_V6p_gene,GTEx_V6p_tissue --fork 8 \ --custom /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/vep_92_germline_annotation/clinvar.vcf.gz,ClinVar_updated_2019Dec,vcf,exact,0,ID,ALLELEID,CLNDN,CLNDISDB,CLNHGVS,CLNREVSTAT,CLNSIG,CLNSIGCONF,CLNVI,DBVARID --custom /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/vep_92_germline_annotation/scap_COMBINED_v1.0_vepedit.vcf.gz,SCAP,vcf,exact,0,id,region,rawscore,sensscore,rawscore_dom,sensscore_dom,rawscore_rec,senscore_rec

Best, Seunghun

helensch commented 4 years ago

Hi

Your VEP options include the --pick option

This option picks one line or block of consequence data per variant, including transcript-specific columns.

If you run without the --pick option VEP will provide annotation on every genomic feature that each input variant overlaps.

Information and examples on using this option are available at: https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_pick

Regards Helen

seunghun23 commented 4 years ago

Hi Helen, Thank you for your kind help. I turned off the pick option and could find more features for the same variant, including one with CDK11B. After reading more about the pick option, I learned that the default setting of VEP without the option is that it tries to annotate every genomic feature. Now a related question on this is, how exactly does VEP assign a gene name for SYMBOL column? For example, for each genomic feature, does it use the genomic location of the variant, see if the location falls between start and end of a transcript or feature and gives the corresponding gene name to SYMBOL? Also, I am using merged cache for my VEP and I wonder how VEP chooses one SOURCE over another (Ensembl vs RefSeq) when there is a gene name corresponding to a variant in both sources.

Best, Seunghun

helensch commented 4 years ago

Hi

For each line of VEP output we find the gene associated with the relevant transcript and give the corresponding gene name.

For merged cache files, the gene symbol is still linked in the same way, so a variant that overlaps both an Ensembl and a RefSeq transcript will take symbol info from the gene linked to each transcript.

For your example variant (chr1:1571464-1571465) using the merged Ensembl and RefSeq cache VEP reports consequences for both Ensembl and RefSeq transcripts.

The VEP results (with a subset of fields) for some of the transcripts is shown below: Consequence SYMBOL Gene Feature_type Feature SYMBOL_SOURCE HGNC_ID
downstream_gene_variant MMP23B ENSG00000189409 Transcript ENST00000512731.1 HGNC 7171
synonymous_variant CDK11B ENSG00000248333 Transcript ENST00000513088.2 HGNC 1729
synonymous_variant CDK11B 984 Transcript NM_001787.3 EntrezGene 1729
downstream_gene_variant MMP23B 8510 Transcript NM_006983.2 EntrezGene 7171

Please let us know if you have any futher queries.

Regards Helen

seunghun23 commented 4 years ago

I see. Things are clear now. Thank you so much for your help!

Best, Seunghun

helensch commented 4 years ago

Hi

Thanks for letting me know that helped.

I am going to close this issue now. If you have any other queries please do open an new issue.

Regards Helen

xiucz commented 1 month ago

Hi, VEP version 112 When I use VEP to annotate chr5 37138845 . T C, it returns C5orf42 but CPLANE1. I would really appreciate it if you could help me with this issue.

https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:25801

singularity run \
 --bind ${vepdir}:/vep \
  --bind ${vepdb}:/vepdb \
 --bind ${refdir}:/ref \
 --bind ${indir}:/input \
 --bind ${outdir}:/output \
/VEP/VEP-112/vep.sif vep \
 --cache --offline --format vcf --vcf --force_overwrite --species homo_sapiens --assembly GRCh37 --fork 8 \
 --canonical --pubmed --refseq --hgvs --symbol --transcript_version --no_escape \
 --fasta /ref/ucsc.hg19.fasta \
 --dir /vepdb/ \
 -i /input/vepin.vcf \
 -o /output/vepout.vcf

Best, xiucz