Closed seunghun23 closed 4 years ago
Hi
Information on gene naming is available at: https://grch37.ensembl.org/info/genome/genebuild/gene_names.html As noted: "..previous symbols will be maintained as ‘synonyms’, however we recommend using the HGNC ID to ensure stability in your pipelines and analyses."
The approved symbol for HGNC_ID: 7171 is MMP23B The approved symbol for HGNC_ID: 1729 is CDK11B
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/7171 MMP23B https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/1729 CDK11B
The HGNC_ID of 7171 is linking to the current approved symbol of MMP23B.
VEP reports consequences for overlapping transcripts. For the variant chr1:1571464-1571465 in your example above, VEP reports consequences for transcripts from 2 genes:
Consequence SYMBOL Gene Existing_variation SYMBOL_SOURCE HGNC_ID
downstream_gene_variant MMP23B ENSG00000189409 rs755205179 HGNC 7171
synonymous_variant CDK11B ENSG00000248333 rs755205179 HGNC 1729
Please let us know if you have any futher queries.
Regards Helen
Hi Helen,
Thank you for your response, but in my output of VEP, it only has the first feature of downstream_gene_variant in MMP23B, but not the correct one in CDK11B. Below is how I run VEP, and is there an option I need to have the VEP report the two transcripts instead of the one I have?
vep -v -i /cromwell_root/fc-secure-ab48ab79-898a-4ab9-91de-f4a1f119d67c/reference_files/test.vcf.gz -o RCC_test.vep.txt \ --tab \ --offline --cache --merged --dir opt/vep/.vep --fasta /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/reference/hg19/fasta/Homo_sapiens_assembly19.fasta \ --use_given_ref /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/reference/hg19/fasta/Homo_sapiens_assembly19.fasta --force_overwrite --stats_text --symbol --canonical --everything \ --regulatory \ --total_length --numbers --domains --pick --variant_class --hgvs --hgvsg --ccds --plugin MPC,opt/vep/.vep/Plugins/data/fordist_constraint_official_mpc_values_v2.txt.gz --plugin dbNSFP,opt/vep/.vep/Plugins/data/dbNSFP_hg19.gz,ExAC_Adj_AC,ExAC_nonTCGA_Adj_AC,ExAC_nonTCGA_Adj_AF,ExAC_Adj_AF,gnomAD_genomes_AC,gnomAD_genomes_AN,gnomAD_genomes_AF,REVEL_score,M-CAP_score,MetaSVM_score,MetaLR_score,GenoCanyon_score,integrated_fitCons_score,clinvar_rs,Interpro_domain,GTEx_V6p_gene,GTEx_V6p_tissue --fork 8 \ --custom /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/vep_92_germline_annotation/clinvar.vcf.gz,ClinVar_updated_2019Dec,vcf,exact,0,ID,ALLELEID,CLNDN,CLNDISDB,CLNHGVS,CLNREVSTAT,CLNSIG,CLNSIGCONF,CLNVI,DBVARID --custom /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/vep_92_germline_annotation/scap_COMBINED_v1.0_vepedit.vcf.gz,SCAP,vcf,exact,0,id,region,rawscore,sensscore,rawscore_dom,sensscore_dom,rawscore_rec,senscore_rec
Best, Seunghun
Hi
Your VEP options include the --pick option
This option picks one line or block of consequence data per variant, including transcript-specific columns.
If you run without the --pick option VEP will provide annotation on every genomic feature that each input variant overlaps.
Information and examples on using this option are available at: https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_pick
Regards Helen
Hi Helen, Thank you for your kind help. I turned off the pick option and could find more features for the same variant, including one with CDK11B. After reading more about the pick option, I learned that the default setting of VEP without the option is that it tries to annotate every genomic feature. Now a related question on this is, how exactly does VEP assign a gene name for SYMBOL column? For example, for each genomic feature, does it use the genomic location of the variant, see if the location falls between start and end of a transcript or feature and gives the corresponding gene name to SYMBOL? Also, I am using merged cache for my VEP and I wonder how VEP chooses one SOURCE over another (Ensembl vs RefSeq) when there is a gene name corresponding to a variant in both sources.
Best, Seunghun
Hi
For each line of VEP output we find the gene associated with the relevant transcript and give the corresponding gene name.
For merged cache files, the gene symbol is still linked in the same way, so a variant that overlaps both an Ensembl and a RefSeq transcript will take symbol info from the gene linked to each transcript.
For your example variant (chr1:1571464-1571465) using the merged Ensembl and RefSeq cache VEP reports consequences for both Ensembl and RefSeq transcripts.
The VEP results (with a subset of fields) for some of the transcripts is shown below: Consequence | SYMBOL | Gene | Feature_type | Feature | SYMBOL_SOURCE | HGNC_ID |
---|---|---|---|---|---|---|
downstream_gene_variant | MMP23B | ENSG00000189409 | Transcript | ENST00000512731.1 | HGNC | 7171 |
synonymous_variant | CDK11B | ENSG00000248333 | Transcript | ENST00000513088.2 | HGNC | 1729 |
synonymous_variant | CDK11B | 984 | Transcript | NM_001787.3 | EntrezGene | 1729 |
downstream_gene_variant | MMP23B | 8510 | Transcript | NM_006983.2 | EntrezGene | 7171 |
Please let us know if you have any futher queries.
Regards Helen
I see. Things are clear now. Thank you so much for your help!
Best, Seunghun
Hi
Thanks for letting me know that helped.
I am going to close this issue now. If you have any other queries please do open an new issue.
Regards Helen
Hi,
VEP version 112
When I use VEP to annotate chr5 37138845 . T C
, it returns C5orf42 but CPLANE1. I would really appreciate it if you could help me with this issue.
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:25801
singularity run \
--bind ${vepdir}:/vep \
--bind ${vepdb}:/vepdb \
--bind ${refdir}:/ref \
--bind ${indir}:/input \
--bind ${outdir}:/output \
/VEP/VEP-112/vep.sif vep \
--cache --offline --format vcf --vcf --force_overwrite --species homo_sapiens --assembly GRCh37 --fork 8 \
--canonical --pubmed --refseq --hgvs --symbol --transcript_version --no_escape \
--fasta /ref/ucsc.hg19.fasta \
--dir /vepdb/ \
-i /input/vepin.vcf \
-o /output/vepout.vcf
Best, xiucz
Hi,
I am using the latest version and cache for VEP (v.100) on GRCh37 and noticed that when using --SYMBOL option, some of the gene names in the SYMBOL column are outdated one. I learned from another thread here that it is because the HGNC gene symbols hadn't been updated since the release 76 and you recommended that people use HGNC_ID instead as a gene identifier. However, when l searched gene name using the HGNC_ID, the ID was still linked to the outdated gene names for a location. For example, if you look at the table below, gene names in the "Gencode_all_genes" column are the most updated gene names on Ensembl/Gencode19 that I annotated using genomic location of variants, and for those variants annotated as MMP23B according to HGNC, if I look for their assigned HGNC_ID, 7171, it would give me MMP23B, but not CDK11B. Looking at these example, I don't understand how I am supposed to use HGNC_ID as a gene identifier when it still gives me an outdated gene name. I would really appreciate it if you could help me with this issue.
Best, Seunghun Han