Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Mapping Ensembl seqlevels to UCSC for Mmul_10/rheMac10 #79

Open hpages opened 1 year ago

hpages commented 1 year ago

Hi Hervé,

Happy New Year!

I’m working with Macaque single cell ATAC-seq data recently, and the package Signac requires the annotation formatted to UCSC style. However, I encountered the problem when I ran:

> ah= AnnotationHub()
> query(ah, c("EnsDb","Macaca", "mulatta")) 
> annotation <- ah[["AH95772"]]
> seqlevelsStyle(annotation) <- "UCSC"
Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") :
 No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.

Could you please let me know if there’s a way to fix this?

Thanks!

Best,

Wen

There's a lot going on here.

The main issue is that this mapping is not properly supported in GenomeInfoDb at the moment.

For example:

library(AnnotationHub)
ah <- AnnotationHub()
ensdb <- ah[["AH95772"]]
x <- seqinfo(ensdb) ## Seqinfo object
> x
Seqinfo object with 329 sequences (1 circular) from Mmul_10 genome:
  seqnames       seqlengths isCircular  genome
  20               77137495      FALSE Mmul_10
  10               99517758      FALSE Mmul_10
  13              108737130      FALSE Mmul_10
  3               185288947      FALSE Mmul_10
  1               223616942      FALSE Mmul_10
  ...                   ...        ...     ...
  QNVO02000404.1      82950      FALSE Mmul_10
  QNVO02000958.1      23808      FALSE Mmul_10
  QNVO02002331.1       8826      FALSE Mmul_10
  QNVO02001857.1      53877      FALSE Mmul_10
  MT                  16564       TRUE Mmul_10

The seqlevelsStyle() getter gets it wrong (the seqnames are the Ensembl seqnames, not the NCBI ones):

> seqlevelsStyle(x)
[1] "NCBI"

And the seqlevelsStyle() setter does a very poor job:

> seqlevelsStyle(x) <- "UCSC"
Warning message:
In (function (seqlevels, genome, new_style)  :
  cannot switch some Mmul_10's seqlevels from NCBI to UCSC style

> x
Seqinfo object with 329 sequences (1 circular) from 2 genomes (Mmul_10, rheMac10):
  seqnames       seqlengths isCircular   genome
  20               77137495      FALSE  Mmul_10
  10               99517758      FALSE  Mmul_10
  13              108737130      FALSE  Mmul_10
  3               185288947      FALSE  Mmul_10
  1               223616942      FALSE  Mmul_10
  ...                   ...        ...      ...
  QNVO02000404.1      82950      FALSE  Mmul_10
  QNVO02000958.1      23808      FALSE  Mmul_10
  QNVO02002331.1       8826      FALSE  Mmul_10
  QNVO02001857.1      53877      FALSE  Mmul_10
  chrM                16564       TRUE rheMac10

> table(genome(x))

 Mmul_10 rheMac10 
     328        1 

Here's how low-level utilities getChromInfoFromEnsembl() and getChromInfoFromUCSC() can be used to do the job. The code is specifically taylored towards Mmul_10/rheMac10 so lacks generality, but it's a start:

## To use on a Seqinfo object only.
switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC <- function(x)
{
    stopifnot(is(x, "Seqinfo"))
    if (!all(genome(x) %in% "Mmul_10"))
        stop(wmsg("'genome(x)' is not set to Mmul_10"))
    x_seqlevels <- seqlevels(x)
    ens_chrominfo  <- getChromInfoFromEnsembl("Mmul_10")
    ens_seqlevels <- ens_chrominfo[ , "name"]
    m0 <- match(x_seqlevels, ens_seqlevels)
    if (any(isNA(m0)))
        stop(wmsg("not all the seqlevels in 'x' belong to Ensembl Mmul_10"))

    ucsc_chrominfo <- getChromInfoFromUCSC("rheMac10", map.NCBI=TRUE)
    ucsc_seqlevels <- ucsc_chrominfo[ , "chrom"]

    ## Map 'ens_seqlevels' to 'ucsc_seqlevels'.
    pseudo_ucsc_seqlevels <- paste0("chr",  ens_seqlevels)
    pseudo_ucsc_seqlevels[pseudo_ucsc_seqlevels == "chrMT"] <- "chrM"
    m1 <- match(pseudo_ucsc_seqlevels, ucsc_seqlevels)
    m2 <- match(ens_seqlevels, ucsc_chrominfo[ , "NCBI.GenBankAccn"])
    idx <- which(is.na(m1))
    m1[idx] <- m2[idx]
    idx <- which(is.na(m2))
    m2[idx] <- m1[idx]

    ## Sanity checks.
    stopifnot(identical(m1, m2))  
    stopifnot(identical(ens_chrominfo[ , "length"],
                        ucsc_chrominfo[m1 , "size"]))

    ## Map 'x_seqlevels' to 'ucsc_seqlevels'.
    seqlevels(x) <- ucsc_seqlevels[m1[m0]]
    genome(x) <- "rheMac10"
    x
}

Then:

x <- seqinfo(ensdb)
x2 <- switch_Mmul_10_seqlevelsStyle_from_Ensembl_to_UCSC(x)
x2
# Seqinfo object with 329 sequences (1 circular) from rheMac10 genome:
#   seqnames                   seqlengths isCircular   genome
#   chr20                        77137495      FALSE rheMac10
#   chr10                        99517758      FALSE rheMac10
#   chr13                       108737130      FALSE rheMac10
#   chr3                        185288947      FALSE rheMac10
#   chr1                        223616942      FALSE rheMac10
#   ...                               ...        ...      ...
#   chr6_NW_021160188v1_random      82950      FALSE rheMac10
#   chrUn_NW_021160819v1            23808      FALSE rheMac10
#   chrUn_NW_021162185v1             8826      FALSE rheMac10
#   chrUn_NW_021161713v1            53877      FALSE rheMac10
#   chrM                            16564       TRUE rheMac10

Another problem is that EnsDb objects don't support the seqinfo() setter so x2 cannot be put back on ensdb:

> seqinfo(ensdb) <- x2
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'seqinfo<-' for signature '"EnsDb"'

Note that this is a limitation of EnsDb objects (implemented in the ensembldb package), so is kind of a separate issue.

Finally, about this error raised by the seqlevelsStyle() setter for EnsDb objects:

> seqlevelsStyle(ensdb) <- "UCSC"
Error in `seqlevelsStyle<-`(`*tmp*`, value = "UCSC") : 
  No mapping of seqlevel styles available in GenomeInfoDb for species Macaca mulatta! Please refer to the Vignette of the GenomeInfoDb package if you would like to provide this mapping.

I guess the error message refers to the Accept-organism-for-GenomeInfoDb.Rmd vignette. However note that those mappings were introduced a long time ago and were never intended to handle scaffolds, only the "main chromosomes". That's because scaffolds are specific to a particular assembly version (e.g. they're not the same in Mmul_10/rheMac10 and in Mmul_8.0.1/rheMac8). So adding a mapping for Macaca mulatta wouldn't actually help here.

wenhu0701 commented 1 year ago

Hi Hervé, thanks a lot for the solution! I wonder how it works for generating a Granges obj with seqlevels to UCSC format. I tried as the followings and got the warning as you mentioned above. Could you pls help with solving this issue? Thanks!

ah= AnnotationHub() ensdb.mmul <- ah[["AH95772"]] gene.ranges <- GetGRangesFromEnsDb(ensdb = ensdb.mmul) seqlevelsStyle(gene.ranges) <- "UCSC"

Warning message: In (function (seqlevels, genome, new_style) : cannot switch some of Mmul_10's seqlevels from NCBI to UCSC style

wenhu0701 commented 1 year ago

I also tried generating the Grange file with the other method by:

url <- "https://ftp.ensembl.org/pub/release-108/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.108.gtf.gz" download.file(url, c("Mmul_10.108.gtf.gz")) mmul.gtf <- import.gff("Mmul_10.108.gtf.gz") seqlevelsStyle(mmul.gtf) <- "UCSC"

It seems like it works with seqlevels from "ensembl" to "UCSC" without warnings, do you think it's a right format?

mmul.gtf

GRanges object with 1433035 ranges and 21 metadata columns: seqnames ranges strand | source type score phase gene_id gene_version gene_source gene_biotype transcript_id transcript_version transcript_source

| [1] chr1 8231-26653 - | ensembl gene NA ENSMMUG00000023296 4 ensembl protein_coding [2] chr1 8231-26653 - | ensembl transcript NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [3] chr1 26570-26653 - | ensembl exon NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [4] chr1 13491-13554 - | ensembl exon NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [5] chr1 13491-13507 - | ensembl CDS NA 0 ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl ... ... ... ... . ... ... ... ... ... ... ... ... ... ... ... [1433031] QNVO02002478.1 2122-2984 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000081412 1 ensembl [1433032] QNVO02002478.1 21-2984 + | ensembl transcript NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433033] QNVO02002478.1 21-818 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433034] QNVO02002478.1 2122-2447 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433035] QNVO02002478.1 2541-2984 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl transcript_biotype tag exon_number exon_id exon_version protein_id protein_version gene_name transcript_name projection_parent_transcript [1] [2] protein_coding Ensembl_canonical [3] protein_coding Ensembl_canonical 1 ENSMMUE00000287659 3 [4] protein_coding Ensembl_canonical 2 ENSMMUE00000287658 1 [5] protein_coding Ensembl_canonical 2 ENSMMUP00000030665 4 ... ... ... ... ... ... ... ... ... ... ... [1433031] lncRNA Ensembl_canonical 2 ENSMMUE00000441322 1 [1433032] lncRNA [1433033] lncRNA 1 ENSMMUE00000475524 1 [1433034] lncRNA 2 ENSMMUE00000506348 1 [1433035] lncRNA 3 ENSMMUE00000474354 1 ------- seqinfo: 329 sequences from an unspecified genome; no seqlengths