Open hpages opened 1 year ago
Hi Hervé, thanks a lot for the solution! I wonder how it works for generating a Granges obj with seqlevels to UCSC format. I tried as the followings and got the warning as you mentioned above. Could you pls help with solving this issue? Thanks!
ah= AnnotationHub() ensdb.mmul <- ah[["AH95772"]] gene.ranges <- GetGRangesFromEnsDb(ensdb = ensdb.mmul) seqlevelsStyle(gene.ranges) <- "UCSC"
Warning message: In (function (seqlevels, genome, new_style) : cannot switch some of Mmul_10's seqlevels from NCBI to UCSC style
I also tried generating the Grange file with the other method by:
url <- "https://ftp.ensembl.org/pub/release-108/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.108.gtf.gz" download.file(url, c("Mmul_10.108.gtf.gz")) mmul.gtf <- import.gff("Mmul_10.108.gtf.gz") seqlevelsStyle(mmul.gtf) <- "UCSC"
It seems like it works with seqlevels from "ensembl" to "UCSC" without warnings, do you think it's a right format?
mmul.gtf
GRanges object with 1433035 ranges and 21 metadata columns: seqnames ranges strand | source type score phase gene_id gene_version gene_source gene_biotype transcript_id transcript_version transcript_source
| [1] chr1 8231-26653 - | ensembl gene NA ENSMMUG00000023296 4 ensembl protein_coding [2] chr1 8231-26653 - | ensembl transcript NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [3] chr1 26570-26653 - | ensembl exon NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [4] chr1 13491-13554 - | ensembl exon NA ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl [5] chr1 13491-13507 - | ensembl CDS NA 0 ENSMMUG00000023296 4 ensembl protein_coding ENSMMUT00000032773 4 ensembl ... ... ... ... . ... ... ... ... ... ... ... ... ... ... ... [1433031] QNVO02002478.1 2122-2984 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000081412 1 ensembl [1433032] QNVO02002478.1 21-2984 + | ensembl transcript NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433033] QNVO02002478.1 21-818 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433034] QNVO02002478.1 2122-2447 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl [1433035] QNVO02002478.1 2541-2984 + | ensembl exon NA ENSMMUG00000059468 1 ensembl lncRNA ENSMMUT00000084852 1 ensembl transcript_biotype tag exon_number exon_id exon_version protein_id protein_version gene_name transcript_name projection_parent_transcript [1] [2] protein_coding Ensembl_canonical [3] protein_coding Ensembl_canonical 1 ENSMMUE00000287659 3 [4] protein_coding Ensembl_canonical 2 ENSMMUE00000287658 1 [5] protein_coding Ensembl_canonical 2 ENSMMUP00000030665 4 ... ... ... ... ... ... ... ... ... ... ... [1433031] lncRNA Ensembl_canonical 2 ENSMMUE00000441322 1 [1433032] lncRNA [1433033] lncRNA 1 ENSMMUE00000475524 1 [1433034] lncRNA 2 ENSMMUE00000506348 1 [1433035] lncRNA 3 ENSMMUE00000474354 1 ------- seqinfo: 329 sequences from an unspecified genome; no seqlengths
There's a lot going on here.
The main issue is that this mapping is not properly supported in GenomeInfoDb at the moment.
For example:
The
seqlevelsStyle()
getter gets it wrong (the seqnames are the Ensembl seqnames, not the NCBI ones):And the
seqlevelsStyle()
setter does a very poor job:Here's how low-level utilities
getChromInfoFromEnsembl()
andgetChromInfoFromUCSC()
can be used to do the job. The code is specifically taylored towards Mmul_10/rheMac10 so lacks generality, but it's a start:Then:
Another problem is that EnsDb objects don't support the
seqinfo()
setter sox2
cannot be put back onensdb
:Note that this is a limitation of EnsDb objects (implemented in the ensembldb package), so is kind of a separate issue.
Finally, about this error raised by the
seqlevelsStyle()
setter for EnsDb objects:I guess the error message refers to the Accept-organism-for-GenomeInfoDb.Rmd vignette. However note that those mappings were introduced a long time ago and were never intended to handle scaffolds, only the "main chromosomes". That's because scaffolds are specific to a particular assembly version (e.g. they're not the same in Mmul_10/rheMac10 and in Mmul_8.0.1/rheMac8). So adding a mapping for Macaca mulatta wouldn't actually help here.