Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
9 stars 8 forks source link

Clarify documentation of 'exclude' slot in a BSParams #1

Closed PeteHaitch closed 4 years ago

PeteHaitch commented 6 years ago

The exclude slot in a BSParams is documented as:

... a character vector with strings that will be used to filter out chromosomes whose names match these strings.

From that, I thought bsapply() was treated it as a string literal. In fact, bsapply() treats it as a regular expression:

https://github.com/Bioconductor/BSgenome/blob/3d109889dc90277ed85b15fc4fa7a4668c0974a1/R/bsapply.R#L78

l ended up spending a fair bit of time banging my head up against this in the example below.

The examples on ?bsapply suggest that treating it as a regular expression is the intended behaviour, and give a nice demonstration of when this behaviour is useful, so I think this is the correct behaviour. But perhaps the documentation could be updated to make this clearer?

suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))

# I was expecting to just get the matches for chr17 but got nothing!
bsp1 <- new("BSParams", 
            X = BSgenome.Hsapiens.UCSC.hg38, 
            FUN = matchPattern,
            exclude = setdiff(seqlevels(BSgenome.Hsapiens.UCSC.hg38), "chr17"))
bsapply(bsp1, pattern = "CG")
#> named list()

# Making it a regular expression gave me the desired result.
bsp2 <- bsp1
bsp2@exclude <- paste0("^", bsp1@exclude, "$")
bsapply(bsp2, pattern = "CG")
#> $chr17
#>   Views on a 83257441-letter DNAString subject
#> subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> views:
#>              start      end width
#>       [1]    60054    60055     2 [CG]
#>       [2]    60141    60142     2 [CG]
#>       [3]    60168    60169     2 [CG]
#>       [4]    60201    60202     2 [CG]
#>       [5]    60210    60211     2 [CG]
#>       ...      ...      ...   ... ...
#> [1248324] 83245477 83245478     2 [CG]
#> [1248325] 83245632 83245633     2 [CG]
#> [1248326] 83246061 83246062     2 [CG]
#> [1248327] 83246281 83246282     2 [CG]
#> [1248328] 83247017 83247018     2 [CG]

Created on 2018-09-17 by the reprex package (v0.2.1)

Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.5.1 (2018-07-02) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz Australia/Melbourne #> date 2018-09-17 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.1 2018-07-05 local #> Biobase 2.41.2 2018-07-18 Bioconductor #> BiocGenerics * 0.27.1 2018-06-17 Bioconductor #> BiocParallel 1.15.12 2018-09-13 Bioconductor #> Biostrings * 2.49.1 2018-08-04 Bioconductor #> bitops 1.0-6 2013-08-17 CRAN (R 3.5.0) #> BSgenome * 1.49.3 2018-07-27 Bioconductor #> BSgenome.Hsapiens.UCSC.hg38 * 1.4.1 2017-11-13 Bioconductor #> compiler 3.5.1 2018-07-05 local #> datasets * 3.5.1 2018-07-05 local #> DelayedArray 0.7.41 2018-09-14 Bioconductor #> devtools 1.13.6 2018-06-27 CRAN (R 3.5.0) #> digest 0.6.17 2018-09-12 CRAN (R 3.5.1) #> evaluate 0.11 2018-07-17 CRAN (R 3.5.0) #> GenomeInfoDb * 1.17.1 2018-05-11 Bioconductor #> GenomeInfoDbData 1.1.0 2017-12-16 Bioconductor #> GenomicAlignments 1.17.3 2018-07-18 Bioconductor #> GenomicRanges * 1.33.13 2018-08-04 Bioconductor #> graphics * 3.5.1 2018-07-05 local #> grDevices * 3.5.1 2018-07-05 local #> grid 3.5.1 2018-07-05 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> IRanges * 2.15.17 2018-08-24 Bioconductor #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.1) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> Matrix 1.2-14 2018-04-13 CRAN (R 3.5.1) #> matrixStats 0.54.0 2018-07-23 CRAN (R 3.5.1) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.1 2018-07-05 local #> parallel * 3.5.1 2018-07-05 local #> Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1) #> RCurl 1.95-4.11 2018-07-15 CRAN (R 3.5.0) #> rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) #> Rsamtools 1.33.5 2018-09-04 Bioconductor #> rtracklayer * 1.41.5 2018-08-31 Bioconductor #> S4Vectors * 0.19.19 2018-07-18 Bioconductor #> stats * 3.5.1 2018-07-05 local #> stats4 * 3.5.1 2018-07-05 local #> stringi 1.2.4 2018-07-20 CRAN (R 3.5.1) #> stringr 1.3.1 2018-05-10 CRAN (R 3.5.0) #> SummarizedExperiment 1.11.6 2018-07-17 Bioconductor #> tools 3.5.1 2018-07-05 local #> utils * 3.5.1 2018-07-05 local #> withr 2.1.2 2018-03-15 CRAN (R 3.5.0) #> XML 3.98-1.16 2018-08-19 CRAN (R 3.5.0) #> XVector * 0.21.3 2018-06-23 Bioconductor #> yaml 2.2.0 2018-07-25 CRAN (R 3.5.1) #> zlibbioc 1.27.0 2018-05-01 Bioconductor ```
hpages commented 4 years ago

This is clarified in BSgenome 1.55.4 (see commit 9783e421). Sorry for letting this sit in a corner for so long.