Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 152 forks source link

How to filter the annotated result for each variant? #1782

Open Xi-Cao opened 4 days ago

Xi-Cao commented 4 days ago

Describe the issue

Hi there, thanks for your work on gene annotation. I annotated my fine-mapping variants using VEP, I recently annotated my fine-mapping variants using VEP, and I have some questions about the results (I input a rsID file):

In the result, I found that rhere are multiple results for most variants, including:

  1. Same rsid, gene, feature but different allele Uploaded_variation Location Allele Gene Feature 18 rs56222534 1:21863905 A ENSG00000162551 ENST00000374840 19 rs56222534 1:21863905 C ENSG00000162551 ENST00000374840
  2. Same rsid, allele but different gene & feature Uploaded_variation Location Allele Gene Feature 13 rs1256328 1:21896767 T CCDS217.1 CCDS217.1 14 rs1256328 1:21896767 T CCDS53274.1 CCDS53274.1 15 rs1256328 1:21896767 T CCDS53275.1 CCDS53275.1 16 rs1256328 1:21896767 T ENSG00000162551 ENST00000374840 17 rs1256328 1:21896767 T 249 NM_001369803.2
  3. Some results without gene symbol Uploaded_variation Location Allele Gene Feature Feature_type SYMBOL 8 rs1256332 1:21893344 A CCDS217.1 CCDS217.1 Transcript - 9 rs1256332 1:21893344 A CCDS53274.1 CCDS53274.1 Transcript - 10 rs1256332 1:21893344 A CCDS53275.1 CCDS53275.1 Transcript - 11 rs1256332 1:21893344 A ENSG00000162551 ENST00000374840 Transcript ALPL 12 rs1256332 1:21893344 A 249 NM_001369803.2 Transcript ALPL

In these cases, how could I select the final one annotation for my input variants?

Thanks, xicao

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

vep -i ~/vep/test/vep_snplist -o vep_mesusie_snp_out.tsv \
 --assembly GRCh37 --cache --cache_version 113 --dir ~/vep --everything --tab --fork 4 --force_overwrite --no_stats \
 --fasta ~/vep/homo_sapiens/113_GRCh37/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz \
 --plugin CADD,snv=/data1/resource/vep_105_data/homo_sapiens/Plugins_data.hg19/whole_genome_SNVs.tsv.gz,indels=/data1/resource/vep_105_data/homo_sapiens/Plugins_data.hg19/gnomad.genomes-exomes.r4.0.indel.tsv.gz 

filter_vep -i vep_mesusie_snp_out.tsv -filter "CANONICAL is YES" -o vep_mesusie_snp_filter.tsv --force_overwrite --no_stats

Full error message

Including the warnings, if available

Data files (if applicable)

They include:

Xi-Cao commented 4 days ago

Sorry for presenting the results unclearly, here are the revisions:

case1:

 Uploaded_variation   Location Allele            Gene         Feature
18         rs56222534 1:21863905      A ENSG00000162551 ENST00000374840
19         rs56222534 1:21863905      C ENSG00000162551 ENST00000374840

case2:

 Uploaded_variation   Location Allele            Gene         Feature
13          rs1256328 1:21896767      T       CCDS217.1       CCDS217.1
14          rs1256328 1:21896767      T     CCDS53274.1     CCDS53274.1
15          rs1256328 1:21896767      T     CCDS53275.1     CCDS53275.1
16          rs1256328 1:21896767      T ENSG00000162551 ENST00000374840
17          rs1256328 1:21896767      T             249  NM_001369803.2

case3:

Uploaded_variation   Location Allele            Gene         Feature Feature_type SYMBOL
8           rs1256332 1:21893344      A       CCDS217.1       CCDS217.1   Transcript      -
9           rs1256332 1:21893344      A     CCDS53274.1     CCDS53274.1   Transcript      -
10          rs1256332 1:21893344      A     CCDS53275.1     CCDS53275.1   Transcript      -
11          rs1256332 1:21893344      A ENSG00000162551 ENST00000374840   Transcript   ALPL
12          rs1256332 1:21893344      A             249  NM_001369803.2   Transcript   ALPL
dglemos commented 4 days ago

Can you please send a link to the output file? CCDS IDs are not supposed to be in the gene and feature columns.

Which cache file did you download? From the results it looks like you run vep with the --merged cache because there are RefSeq transcripts in the output (example: NM_001369803.2). However, your VEP command is using the ensembl cache (default).

For case1: rs56222534 (check variant page) has two alternative alleles A and C. VEP returns annotation for each of the alternative alleles in different rows.

case2 and case3 should not have multiple rows if you use the ensembl cache homo_sapiens_vep_113_GRCh37.tar.gz.

Xi-Cao commented 4 days ago

Thanks for your reply!

I did download the homo_sapiens_merged_vep_113_GRCh37.tar.gz for cache. So Is the homo_sapiens_vep_113_GRCh37.tar.gz a more suitable option as the cache? Will using the default --cache command with a merged cache file affect the results? I didn't seem to receive any warnings or errors.

I'll try again with the ensembl cache homo_sapiens_vep_113_GRCh37.tar.gz. And the variant in case1 did have two alternative alleles. Attached is my annotation results. The filename is slightly different from the command because I modified it.

Thanks again, xicao

vep_mesusie_snp_out2.txt

dglemos commented 4 days ago

I didn't mean to imply that the merged cache is incorrect. I was simply trying to understand which cache was being used, as it’s not immediately clear from the VEP command. If you want to run with homo_sapiens_merged_vep_113_GRCh37.tar.gz, your output will include both Ensembl and RefSeq transcripts. This explains why in case 2 you have the following:

16          rs1256328 1:21896767      T ENSG00000162551 ENST00000374840
17          rs1256328 1:21896767      T             249  NM_001369803.2

and case 3:

11          rs1256332 1:21893344      A ENSG00000162551 ENST00000374840   Transcript   ALPL
12          rs1256332 1:21893344      A             249  NM_001369803.2   Transcript   ALPL

Thank you for sending the output file. From this file, I can see that you run the following command:

vep 
  --assembly GRCh37 
  --cache 
  --cache_version 113 
  --everything 
  --fasta [PATH]/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz 
  --force_overwrite 
  --fork 4 
  --input_file [PATH]/vep_snplist 
  --no_stats 
  --output_file STDOUT 
  --plugin CADD,snv=[PATH]/whole_genome_SNVs.tsv.gz,indels=[PATH]/gnomad.genomes-exomes.r4.0.indel.tsv.gz 
  --tab

Can you please re-run vep with the following command:

vep 
  --assembly GRCh37 
  --cache 
  --cache_version 113 
  --everything 
  --merged
  --fasta [PATH]/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz 
  --force_overwrite 
  --input_file [PATH]/vep_snplist 
  --output_file output.txt
  --tab

If vep_snplist is too big, please run a subset of the file. After re-running vep, do you still have CCDS transcripts in the gene column?

Xi-Cao commented 4 days ago

Thanks a lot for your suggestion.

I ran vep with the your command for the first 10 variants, and the CCDS transcripts disappeared from my results (output_filter.txt). It retained only Ensembl and RefSeq transcripts, as you said. Considering the two similar commands, did the--plugin option or the omission of --merged cause the additional transcripts? output_filter.txt

rs1256332   1:21893344  A   ENSG00000162551 ENST00000374840 Transcript  intron_variant
rs1256332   1:21893344  A   249 NM_001369803.2  Transcript  intron_variant

Then I ran again with the homo_sapiens_vep_113_GRCh37.tar.gz cache file, deleting --merged command. It worked well and included only the Ensembl transcripts. output1_filter.txt

Thanks, xicao

dglemos commented 1 day ago

I'm glad it worked! We don't have any report indicating that the --plugin or --merged options interfere with the output that way. Can you please try with --plugin?

Xi-Cao commented 1 day ago

Thanks~ Following your suggestion, I re-ran VEP with the --merged and --plugin options added, respectively. The CCDS transcripts did not appear on either occasion.

Additionally, it seems that the intergenic variants and regulatory-region variants were not annotated to any gene in all results. Is there a command I can use to map these variants to the closest gene through VEP?

Best, xicao

dglemos commented 23 hours ago

You mean the output of the vep command or the filter_vep?

Xi-Cao commented 22 hours ago

Thanks for your reply.

I would like the intergenic and regulatory-region variants to be mapped to the nearby gene in the annotation results, instead of having a "-" in the Gene column. For example, when I used ANNOVAR, it displayed the nearby gene and the distance for an intergenic variant. However, in the VEP results, the Gene column shows "-". So I would like to know if there could be an option in VEP that can annotate the closest gene for these variants?

Best, xicao

dglemos commented 22 hours ago

VEP has the option --distance to modify the distance up and downstream between a variant and a transcript for which VEP will assign the upstream_gene_variant or downstream_gene_variant consequences. By default, this distance is 5000bp.

To include regulatory information, you can use the option --regulatory.

You can read more about these two options here: https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_distance https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_regulatory

dglemos commented 22 hours ago

There is also the VEP plugin NearestGene that finds the nearest gene(s). More than one gene may be reported if the genes overlap the variant or if genes are equidistant.

Xi-Cao commented 22 hours ago

I see. Sincerely thanks for your kind and helpful suggestions!

Best, xicao

发件人:"Diana Lemos" @.> 发送日期:2024-11-05 01:10:03 收件人:"Ensembl/ensembl-vep" @.> 抄送人: 主 题:Re: [Ensembl/ensembl-vep] How to filter the annotated result for each variant? (Issue #1782)

There is also the VEP plugin NearestGene that finds the nearest gene(s). More than one gene may be reported if the genes overlap the variant or if genes are equidistant.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>