locateVariants returns genes from both Forward and Reverse strands in PRECEDEID and FOLLOWID

amizeranschi commented 2 years ago

Hello,

I'm looking at an intergenic variant (in the bovine genome) and want to figure the genes relative to which it is downstream or upstream, i.e. relative to the gene's position and its strand. I used locateVariants() with region=IntergenicVariants(upstream=1000000, downstream=1000000) in order to do this. However, some results puzzle me.

Here's what my variant looks like in the gene annotation results:

> all_var_df[rownames(all_var_df) == "AX-106756303", ]

             seqnames    start      end width strand   LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID    PRECEDEID     FOLLOWID
AX-106756303        1 34617002 34617002     1      * intergenic       NA     NA       1 <NA>         <NA> ENSBTAG0.... ENSBTAG0....

Here's what PRECEDEID looks like:

lapply(all_var_df[rownames(all_var_df) == "AX-106756303", ]$PRECEDEID, function(X) {mapIds( org.Bt.eg.db, keys=X, column="SYMBOL", keytype="ENSEMBL", multiVals="first") } )

ENSBTAG00000019313 ENSBTAG00000016711 ENSBTAG00000003877 ENSBTAG00000044714 ENSBTAG00000020940 ENSBTAG00000020939 ENSBTAG00000001656 ENSBTAG00000045788 ENSBTAG00000006536 
           "ZMIZ1"             "PPIF"          "ZCCHC24"                 NA           "ANXA11"            "PLAC9"          "TMEM254"                 NA             "CL46"

When looking closer at these genes in Ensembl, I notice that they are all located "to the right" of the SNP location on the forward strand and that some of them are on the Forward strand (e.g. ZMIZ1 and PPIF), while others are on the Reverse strand (e.g. ZCCHC24 and PLAC9):

https://oct2018.archive.ensembl.org/Bos_taurus/Location/View?db=core;g=ENSBTAG00000013264;r=28:34948822-35454825;t=ENSBTAT00000046662

This seems a bit confusing. Shouldn't the variant be considered upstream relative to the two genes on the Forward strand, and downstream relative to the genes in the Reverse strand?

How should the PRECEDEID gene list be interpreted, more precisely?

vjcitn commented 2 years ago

Thanks for posting here. I suggest you produce a reproducible example and post to support.bioconductor.org. The developers of VariantAnnotation are no longer engaged in the project and we will have to mobilize community energies as much as possible to sort this out.

amizeranschi commented 2 years ago

Alright, thanks for the advice. In the meantime, I've just discovered the ensemblVEP Bioc package. I will give this a try as well and compare the results, just in case I'm misinterpreting VariantAnnotation's PRECEDEID and FOLLOWID.

I will then make a post on support.bioconductor.org, if this is still warranted.

amizeranschi commented 2 years ago

OK, I tested Ensembl VEP now, the standalone tool (v. 96, installed from Bioconda) and it's giving results as I would expect them. Below is the output in tabular form, for the SNP mentioned in the original post. The upstream / downstream genes make sense to me, according to what I see on the Ensembl link that I posted above. Specifically, ZMIZ1 and PPIF are annotated as upstream_gene_variant, whereas ZCCHC24 and PLAC9 are labeled as downstream_gene_variant, although VariantAnnotation included all four genes into the PRECEDEID list for that variant.

#Uploaded_variation Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  IMPACT  DISTANCE    STRAND  FLAGS   SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CLIN_SIG    SOMATIC PHENO
AX-106756303    28:34617002 A   ENSBTAG00000001656  ENSBTAT00000002166  Transcript  downstream_gene_variant -   -   -   -   -   rs110191873 MODIFIER    808158  -1  -   TMEM254 VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000003877  ENSBTAT00000005060  Transcript  downstream_gene_variant -   -   -   -   -   rs110191873 MODIFIER    572620  -1  -   ZCCHC24 VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000013187  ENSBTAT00000017554  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    774051  -1  -   DLG5    VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000013259  ENSBTAT00000017638  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    692019  -1  -   POLR3A  VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000006536  ENSBTAT00000018649  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    974904  1   -   CGN1    EntrezGene  -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000016711  ENSBTAT00000022213  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    550543  1   -   PPIF    VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000019313  ENSBTAT00000025717  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    415453  1   -   ZMIZ1   VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000006536  ENSBTAT00000025788  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    975781  1   -   CGN1    EntrezGene  -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000020939  ENSBTAT00000027890  Transcript  downstream_gene_variant -   -   -   -   -   rs110191873 MODIFIER    749023  -1  -   PLAC9   VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000020940  ENSBTAT00000027893  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    701688  1   -   ANXA11  VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000013264  ENSBTAT00000046662  Transcript  downstream_gene_variant -   -   -   -   -   rs110191873 MODIFIER    685225  1   -   RPS24   EntrezGene  -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000020940  ENSBTAT00000054137  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    701688  1   -   ANXA11  VGNC    -   protein_coding  -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000044714  ENSBTAT00000062147  Transcript  downstream_gene_variant -   -   -   -   -   rs110191873 MODIFIER    694133  -1  -   -   -   -   miRNA   -   -   -
AX-106756303    28:34617002 A   ENSBTAG00000045788  ENSBTAT00000063616  Transcript  upstream_gene_variant   -   -   -   -   -   rs110191873 MODIFIER    816478  1   -   -   -   -   protein_coding  -   -   -

For the record, the source code responsible for computing PRECEDEID and FOLLOWID for intergenic variants can be found here: https://github.com/Bioconductor/VariantAnnotation/blob/master/R/methods-locateVariants.R#L488

So, either there is a bug in VariantAnnotation, or its PRECEDEID and FOLLOWID columns were meant to answer a completely different question than the one answered by Ensembl VEP. However, given the comments in the source code (## upstream == follow: and ## downstream == precede:), this doesn't seem likely to be the case.

@vjcitn I will prepare a reproducible example and post it on support.bioconductor.org, as you suggested, but any other thoughts you have on this would be welcome in the meantime.

Bioconductor / VariantAnnotation

locateVariants returns genes from both Forward and Reverse strands in PRECEDEID and FOLLOWID #55