Open amizeranschi opened 2 years ago
Thanks for posting here. I suggest you produce a reproducible example and post to support.bioconductor.org. The developers of VariantAnnotation are no longer engaged in the project and we will have to mobilize community energies as much as possible to sort this out.
Alright, thanks for the advice. In the meantime, I've just discovered the ensemblVEP
Bioc package. I will give this a try as well and compare the results, just in case I'm misinterpreting VariantAnnotation's PRECEDEID and FOLLOWID.
I will then make a post on support.bioconductor.org, if this is still warranted.
OK, I tested Ensembl VEP
now, the standalone tool (v. 96, installed from Bioconda) and it's giving results as I would expect them. Below is the output in tabular form, for the SNP mentioned in the original post. The upstream / downstream genes make sense to me, according to what I see on the Ensembl link that I posted above. Specifically, ZMIZ1 and PPIF are annotated as upstream_gene_variant
, whereas ZCCHC24 and PLAC9 are labeled as downstream_gene_variant
, although VariantAnnotation
included all four genes into the PRECEDEID list for that variant.
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND FLAGS SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CLIN_SIG SOMATIC PHENO
AX-106756303 28:34617002 A ENSBTAG00000001656 ENSBTAT00000002166 Transcript downstream_gene_variant - - - - - rs110191873 MODIFIER 808158 -1 - TMEM254 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000003877 ENSBTAT00000005060 Transcript downstream_gene_variant - - - - - rs110191873 MODIFIER 572620 -1 - ZCCHC24 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000013187 ENSBTAT00000017554 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 774051 -1 - DLG5 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000013259 ENSBTAT00000017638 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 692019 -1 - POLR3A VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000006536 ENSBTAT00000018649 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 974904 1 - CGN1 EntrezGene - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000016711 ENSBTAT00000022213 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 550543 1 - PPIF VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000019313 ENSBTAT00000025717 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 415453 1 - ZMIZ1 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000006536 ENSBTAT00000025788 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 975781 1 - CGN1 EntrezGene - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000020939 ENSBTAT00000027890 Transcript downstream_gene_variant - - - - - rs110191873 MODIFIER 749023 -1 - PLAC9 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000020940 ENSBTAT00000027893 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 701688 1 - ANXA11 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000013264 ENSBTAT00000046662 Transcript downstream_gene_variant - - - - - rs110191873 MODIFIER 685225 1 - RPS24 EntrezGene - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000020940 ENSBTAT00000054137 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 701688 1 - ANXA11 VGNC - protein_coding - - -
AX-106756303 28:34617002 A ENSBTAG00000044714 ENSBTAT00000062147 Transcript downstream_gene_variant - - - - - rs110191873 MODIFIER 694133 -1 - - - - miRNA - - -
AX-106756303 28:34617002 A ENSBTAG00000045788 ENSBTAT00000063616 Transcript upstream_gene_variant - - - - - rs110191873 MODIFIER 816478 1 - - - - protein_coding - - -
For the record, the source code responsible for computing PRECEDEID and FOLLOWID for intergenic variants can be found here: https://github.com/Bioconductor/VariantAnnotation/blob/master/R/methods-locateVariants.R#L488
So, either there is a bug in VariantAnnotation
, or its PRECEDEID and FOLLOWID columns were meant to answer a completely different question than the one answered by Ensembl VEP
. However, given the comments in the source code (## upstream == follow:
and ## downstream == precede:
), this doesn't seem likely to be the case.
@vjcitn I will prepare a reproducible example and post it on support.bioconductor.org, as you suggested, but any other thoughts you have on this would be welcome in the meantime.
Hello,
I'm looking at an intergenic variant (in the bovine genome) and want to figure the genes relative to which it is downstream or upstream, i.e. relative to the gene's position and its strand. I used
locateVariants()
withregion=IntergenicVariants(upstream=1000000, downstream=1000000)
in order to do this. However, some results puzzle me.Here's what my variant looks like in the gene annotation results:
Here's what PRECEDEID looks like:
When looking closer at these genes in Ensembl, I notice that they are all located "to the right" of the SNP location on the forward strand and that some of them are on the Forward strand (e.g. ZMIZ1 and PPIF), while others are on the Reverse strand (e.g. ZCCHC24 and PLAC9):
https://oct2018.archive.ensembl.org/Bos_taurus/Location/View?db=core;g=ENSBTAG00000013264;r=28:34948822-35454825;t=ENSBTAT00000046662
This seems a bit confusing. Shouldn't the variant be considered upstream relative to the two genes on the Forward strand, and downstream relative to the genes in the Reverse strand?
How should the PRECEDEID gene list be interpreted, more precisely?