Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

locateVariant from VariantAnnotation returns wrong results #76

Closed nickhir closed 8 months ago

nickhir commented 8 months ago

(I have also posted this issue on the Bioconductor Forum but was not sure if it is monitored by the authors, so I am reposting it here)

Hi all,

I am using VariantAnnotation::locateVariants to get the location (i.e. intron, intergeneic, ...) of a list of SNPs. I have noticed however, that the results returned by wrong, when I manually look them up in the UCSC genome browser. Here is one such example (I am using R.4.2.2, TxDb.Hsapiens.UCSC.hg19.knownGene 3.2.2 and VariantAnnotation 1.44.1):

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

snp <- GRanges(seqnames="chr1", ranges = 109706393) # corresponds to SNP rs649539
locateVariants(snp, txdb, AllVariants(), ignore.strand=T)

returns the following result:

GRanges object with 10 ranges and 9 metadata columns:
       seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
          <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
   [1]     chr1 109706393      + |   intron     49314     49314         1        1785
   [2]     chr1 109706393      + |   intron     49314     49314         1        1786
   [3]     chr1 109706393      + |   intron    135483    135483         1        1786
   [4]     chr1 109706393      + |   intron     49435     49435         1        1787
   [5]     chr1 109706393      + |   intron     49314     49314         1        1788
   [6]     chr1 109706393      + |   intron     49314     49314         1       74189
   [7]     chr1 109706393      + |   intron     49314     49314         1       74190
   [8]     chr1 109706393      + |   intron    135483    135483         1       74190
   [9]     chr1 109706393      + |   intron     49435     49435         1       74191
  [10]     chr1 109706393      + |   intron    135191    135191         1       74191
               CDSID      GENEID       PRECEDEID        FOLLOWID
       <IntegerList> <character> <CharacterList> <CharacterList>
   [1]                      5825                                
   [2]                      5825                                
   [3]                      5825                                
   [4]                      5825                                
   [5]                      5825                                
   [6]                    150365                                
   [7]                    150365                                
   [8]                    150365                                
   [9]                    150365                                
  [10]                    150365                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Gene ID 5825 corresponds to the ABCD3 gene and gene id 150365 corresponds to the MEI1 gene. However, if I go to the position of the SNP in the UCSC genome browser, it shows me that I am in an intronic region of ELAPOR1 (see image below).

region

Furthermore, if I use txdb to look up the chromosome on which these genes are located I get the following results:

select(txdb, keys = c("5825", "150365"), columns="TXCHROM", keytype="GENEID")

leads to:

  GENEID TXCHROM
1   5825    chr1
2 150365   chr22

I would very much appreciate any insights into what is happening here, because it seems like the tool is just completely off in this case.

nickhir commented 8 months ago

I received an excellent answer on the Bioconductor forum and am currently using the following workaround: I would greatly appreciate if you can quickly have a look at it and tell me if this is correct:

cache <- new.env(parent = emptyenv())

# CodingVariants
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
z <- cdsBy(txdb)
z <- z[lengths(z) > 0L]
cache[["cdsbytx"]] <- z

# IntronVariants
z <- intronsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["intbytx"]] <- z

# ThreeUTRVariants 
z <- threeUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["threeUTRbytx"]] <- z

# FiveUTRVariants 
z <- fiveUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["fiveUTRbytx"]] <- z

# IntergenicVariants 
z <- transcriptsBy(txdb, "gene")
z <- z[lengths(z) > 0L]
cache[["txbygene"]]  <- z

# SpliceSiteVariants 
# no need to populate cache further ?

# PromoterVariants 
z <- transcripts(txdb)
z <- z[lengths(z) > 0L]
cache[["tx"]] <- splitAsList(z, seq_len(length(z))) 
names(cache[["tx"]]) <- z$tx_id

snp <- GRanges(seqnames = "chr22", ranges = 50301422)
locateVariants(snp, txdb, AllVariants(), cache = cache)