Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Mitocondrial chromosome naming, seqlevelsStyle() #8

Open Roleren opened 5 years ago

Roleren commented 5 years ago

I thanks for a nice function, I use seqlevelsStyle a lot.

One thing has always bothered me, is for the mitocondrial chromosome.

Two objects with same naming convention as these:

> seqlevelsStyle(bamFile)
[1] "UCSC"
> seqlevelsStyle(cds)
[1] "UCSC"

Can show seqlevels as this:

seqlevels(bamFile)
[1] "MT" 
seqlevels(cds)
[1] "chrM"

So even though they use the same seqlevelsStyle, the objects does not name the mitochondrial chromosome the same. Is there a reason for this ? :)

hpages commented 5 years ago

I can't reproduce this :(

library(GenomicRanges)
seqlevelsStyle(GRanges("chrM:1-5"))
# [1] "UCSC"
seqlevelsStyle(GRanges("MT:1-5"))
# [1] "NCBI"    "Ensembl"

sessionInfo()

R version 3.6.0 Patched (2019-05-02 r76454)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-3.6.r76454/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.6.r76454/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicRanges_1.39.1 GenomeInfoDb_1.22.0  IRanges_2.21.1      
[4] S4Vectors_0.24.0     BiocGenerics_0.32.0 

loaded via a namespace (and not attached):
[1] zlibbioc_1.32.0        compiler_3.6.0         XVector_0.26.0        
[4] GenomeInfoDbData_1.2.2 RCurl_1.95-4.12        bitops_1.0-6          
Roleren commented 5 years ago

Basic reproduction is this:

# Normal case works
NCBI <- GRanges(seqnames = c("MT", paste0(1:2)), 1)
seqlevelsStyle(NCBI)
# "NCBI"    "Ensembl"
seqlevelsStyle(NCBI) <- "UCSC"
seqlevels(NCBI)
# [1] "chrM" "chr1" "chr2"

# This does not ( This is the case for my bam file)
mixed <- GRanges(seqnames = c("MT", paste0("chr", 1:2)), 1)
seqlevelsStyle(mixed)
# "UCSC"
seqlevelsStyle(mixed) <- "UCSC"
seqlevels(mixed)
# [1] "MT"    "chr1"  "chr2"  (<- It should swap MT for chrM here I think when you say "UCSC")

And this one gives a warning

# One of each (this works)
mixed <- GRanges(seqnames = c("MT", paste0("chr", 1)), 1)
seqlevelsStyle(mixed)
"NCBI"    "UCSC"    "Ensembl"
seqlevelsStyle(mixed) <- "UCSC"
# Warning message:
# In .replace_seqlevels_style(x_seqlevels, value) :
#   found more than one best sequence renaming map compatible with seqname style "UCSC" for # this object, using the first one
seqlevels(mixed)
# [1] "chrM" "chr1"
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ORFik_1.7.6                 reshape2_1.4.3              rtracklayer_1.46.0         
 [4] GenomicAlignments_1.22.1    Rsamtools_2.2.1             Biostrings_2.54.0          
 [7] XVector_0.26.0              SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
[10] BiocParallel_1.20.0         matrixStats_0.55.0          GenomicFeatures_1.38.0     
[13] AnnotationDbi_1.48.0        Biobase_2.46.0              GenomicRanges_1.38.0       
[16] GenomeInfoDb_1.22.0         IRanges_2.20.1              S4Vectors_0.24.0           
[19] BiocGenerics_0.32.0         ggplot2_3.2.1               data.table_1.12.6          

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2     progress_1.2.2        
 [5] httr_1.4.1             tools_3.6.0            backports_1.1.5        R6_2.4.1              
 [9] rpart_4.1-15           Hmisc_4.3-0            DBI_1.0.0              lazyeval_0.2.2        
[13] colorspace_1.4-1       nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5      
[17] gridExtra_2.3          prettyunits_1.0.2      GGally_1.4.0           DESeq2_1.26.0         
[21] bit_1.1-14             curl_4.2               compiler_3.6.0         htmlTable_1.13.2      
[25] labeling_0.3           scales_1.1.0           checkmate_1.9.4        genefilter_1.68.0     
[29] askpass_1.1            rappdirs_0.3.1         stringr_1.4.0          digest_0.6.23         
[33] foreign_0.8-72         base64enc_0.1-3        pkgconfig_2.0.3        htmltools_0.4.0       
[37] dbplyr_1.4.2           htmlwidgets_1.5.1      rlang_0.4.2            rstudioapi_0.10       
[41] RSQLite_2.1.2          farver_2.0.1           acepack_1.4.1          dplyr_0.8.3           
[45] RCurl_1.95-4.12        magrittr_1.5           GenomeInfoDbData_1.2.2 Formula_1.2-3         
[49] Matrix_1.2-18          Rcpp_1.0.3             munsell_0.5.0          lifecycle_0.1.0       
[53] stringi_1.4.3          zlibbioc_1.32.0        plyr_1.8.4             BiocFileCache_1.10.2  
[57] grid_3.6.0             blob_1.2.0             crayon_1.3.4           lattice_0.20-38       
[61] splines_3.6.0          annotate_1.64.0        hms_0.5.2              locfit_1.5-9.1        
[65] zeallot_0.1.0          knitr_1.26             pillar_1.4.2           geneplotter_1.64.0    
[69] biomaRt_2.42.0         XML_3.98-1.20          glue_1.3.1             latticeExtra_0.6-28   
[73] BiocManager_1.30.10    vctrs_0.2.0            gtable_0.3.0           openssl_1.4.1         
[77] purrr_0.3.3            reshape_0.8.8          assertthat_0.2.1       xfun_0.11             
[81] xtable_1.8-4           survival_3.1-7         tibble_2.1.3           memoise_1.1.0         
[85] cluster_2.1.0      
Roleren commented 4 years ago

So to sum up, this is the case:

seqlevelsStyle support shifting of style when the cases are clear.

When there are a mix of styles with only 2 chromosomes, it switches correctly but gives you a warning.

When there are a mix of styles with > 2 chromosomes, it switches incorrectly. This is the case I would like, and the most logical is that it behaves as the case for 2 chromosomes, giving a warning, but switching correctly.

That is: if : "MT", "chr1" gives -> "chrM" "chr1" then: "MT", "chr1", "chr2" should give -> "chrM" "chr1", "chr2"

Which does not happen now.

Roleren commented 4 years ago

I think this is quite a serious issue, I get it all the time warnings like this: Warning messages:

1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
- in 'x': chrY, chrX
 - in 'y': X, Y

This is after they should have had equal seqlevels naming.

So I think a lot of people are losing the sex chromosomes + sometimes mitochondrial, if they use different seqnames and trust that seqlevelsstyle will fix it.