Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

seqlevelsStyle setter does not work with specific genome build #13

Closed LiNk-NY closed 4 years ago

LiNk-NY commented 4 years ago

Hi Hervé, @hpages I encountered this issue when working for a fix for TCGAutils in devel. Am I taking the right approach here? Best, Marcel

suppressPackageStartupMessages({
    library(curatedTCGAData)
    library(TCGAutils)
    library(GenomeInfoDb)
})

suppressMessages({
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
    assays = c("CNASeq", "Mutation", "miRNA*",
        "RNASeq2*", "mRNAArray", "Methyl*"), dry.run = FALSE)
})

rag <- "COAD_Mutation-20160128"
grag <- rowRanges(coad[[rag]])

genome(grag)
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" "36" 
#>   17   18   19   20   21   22    X    Y 
#> "36" "36" "36" "36" "36" "36" "36" "36"
genome(grag) <- translateBuild(genome(grag))
genome(grag)
#>      1      2      3      4      5      6      7      8      9     10     11 
#> "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
#>     12     13     14     15     16     17     18     19     20     21     22 
#> "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
#>      X      Y 
#> "hg18" "hg18"

seqlevelsStyle(grag) <- "UCSC"

seqlevels(grag)
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"

# OTOH

gr <- rowRanges(coad[[rag]])
genome(gr) <- "Homo_sapiens"
seqlevelsStyle(gr) <- "UCSC"

seqlevels(gr)
#>  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
#> [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
#> [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"

Created on 2020-07-06 by the reprex package (v0.3.0)

hpages commented 4 years ago

Marcel,

seqlevelsStyle() now uses the genome field to infer the style so getting it right is important. If you manually set the genome to hg18 you're saying "this is genome hg18" so when you do seqlevelsStyle() <- "UCSC" nothing happens because seqlevelsStyle() sees that this is already a UCSC genome.

So one way to avoid this is to change the genome build after you change the seqlevels style:

grag <- rowRanges(coad[[rag]])
unique(genome(grag))
# [1] "36"
seqlevelsStyle(grag) <- "UCSC"
genome(grag) <- translateBuild(genome(grag))
seqinfo(grag)
# Seqinfo object with 24 sequences from hg18 genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   chr1           <NA>       <NA>   hg18
#   chr2           <NA>       <NA>   hg18
#   chr3           <NA>       <NA>   hg18
#   chr4           <NA>       <NA>   hg18
#   chr5           <NA>       <NA>   hg18
#   ...             ...        ...    ...
#   chr20          <NA>       <NA>   hg18
#   chr21          <NA>       <NA>   hg18
#   chr22          <NA>       <NA>   hg18
#   chrX           <NA>       <NA>   hg18
#   chrY           <NA>       <NA>   hg18

But It seems that grag is in fact based on assembly NCBI36 so what you should really do is set genome to this instead of 36. This will allow seqlevelsStyle() to do a better job:

grag <- rowRanges(coad[[rag]])
genome(grag) <- "NCBI36"
seqlevelsStyle(grag)
# [1] "NCBI"
seqlevelsStyle(grag) <- "UCSC"
seqinfo(grag)
# Seqinfo object with 24 sequences from hg18 genome; no seqlengths:
#   seqnames seqlengths isCircular genome
#   chr1           <NA>       <NA>   hg18
#   chr2           <NA>       <NA>   hg18
#   chr3           <NA>       <NA>   hg18
#   chr4           <NA>       <NA>   hg18
#   chr5           <NA>       <NA>   hg18
#   ...             ...        ...    ...
#   chr20          <NA>       <NA>   hg18
#   chr21          <NA>       <NA>   hg18
#   chr22          <NA>       <NA>   hg18
#   chrX           <NA>       <NA>   hg18
#   chrY           <NA>       <NA>   hg18

The bottom line is that if you use the official NCBI assembly name (as reported here https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.12/) seqlevelsStyle() will know what to do.

Hope this helps, H.

hpages commented 4 years ago

Hi Marcel @LiNk-NY, hope you were able to sort this out. Don't hesitate to reopen the issue if not.

LiNk-NY commented 4 years ago

Hi Hervé, Thanks for the response. It works for me now. I'm updating some functions in TCGAutils to coincide with this change.