Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

changes to make interaction with the Biostrings classes more easy as … #22

Closed FelixErnst closed 5 years ago

FelixErnst commented 5 years ago

…proposed during review of the Modstrings package. This includes following changes: xscodes generic and default function added, use of the normal Views() constructor instead of the previously used Biostrings internal constructor function for the match(LR)Pattern function.

This is redo of PR #21, in which I made a mistake updating the fork during development. This lead to conflicts during merging and messed up history.

FelixErnst commented 5 years ago

@hpages new PR

FelixErnst commented 5 years ago

I updated the Modstrings package and everything works fine. So this PR could be merged into master.

FelixErnst commented 5 years ago

An additional comment:

I noticed during checking the package on two different systems, that the latest IRanges version is required for Biostrings 2.51.4. The Bioc 3.8 version of IRanges reports an error during creation of a Mask using Mask(). The did not happen until latest two commits and the changes in the S4Vectors package. I am not quite sure about the reason.

hpages commented 5 years ago

OK I've merged this. Thanks for sending this PR.

After the merge, I had to tweak the width() method for character a little bit (otherwise things like SolexaQuality(99L) would fail for me): https://github.com/Bioconductor/Biostrings/commit/85b7b1fce1979dc1b0924ba96b1734647d14fcf9

I'm not clear about this error about Mask creation that you are describing above. What error message do you get exactly? How can I reproduce it? sessionInfo()? Thanks!

FelixErnst commented 5 years ago

Thanks for the merge.

Here is the info about the error. I think it is not really an error, but a dependency to a certain IRanges version. Installing IRanges from GitHub resolves the issue.

library(IRanges)
Mask(mask.width=5, start=c(2), width=c(3))
#> Error in get(name, envir = asNamespace(pkg), inherits = FALSE): object 'stopIfProblems' not found
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] IRanges_2.16.0      S4Vectors_0.21.13   BiocGenerics_0.28.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0      digest_0.6.18   magrittr_1.5    evaluate_0.13  
#>  [5] highr_0.7       stringi_1.3.1   rmarkdown_1.11  tools_3.5.2    
#>  [9] stringr_1.4.0   xfun_0.5        yaml_2.2.0      compiler_3.5.2 
#> [13] htmltools_0.3.6 knitr_1.21
FelixErnst commented 5 years ago

Regarding 85b7b1f: On which system did you test this? Linux or Windows? During development of Modstrings I had a particular situation, in which Linux/MacOS and Windows behave differently due to UTF8 encoding capabilities or more precise the behavior towards UTF8 encoded characters.

(edit: I didn't get an error on Linux with the R dev version? Did I miss something or was it not included in the tests?, edit2: I did not run R CMD check, just the .test() 🤕 It broke for me as well, but I didn't catch it. Sorry for that)

The example you included in the comment above width() works for me on Windows...

x <- rawToChar(as.raw(135L))
nchar(x, type="chars")
#> [1] 1
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.0      stringi_1.3.1   rmarkdown_1.11 
#>  [9] highr_0.7       knitr_1.21      stringr_1.4.0   xfun_0.5       
#> [13] digest_0.6.18   evaluate_0.13

... and fails on Linux. (dev version aside)

> x <- rawToChar(as.raw(135L))
> nchar(x, type="chars")
Error in nchar(x, type = "chars") : invalid multibyte string, element 1
> sessionInfo()
R Under development (unstable) (2018-12-22 r75884)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0      ps_1.3.0        withr_2.1.2     digest_0.6.18   R6_2.4.0        evaluate_0.13   rlang_0.3.1    
 [8] fs_1.2.6        callr_3.1.1     whisker_0.3-2   rmarkdown_1.11  tools_3.6.0     xfun_0.5        yaml_2.2.0     
[15] compiler_3.6.0  processx_3.2.1  clipr_0.5.0     htmltools_0.3.6 knitr_1.22  

Just fyi.

hpages commented 5 years ago

nchar(rawToChar(as.raw(135L)), type="chars") will fail or not depending on the current value of LC_CTYPE. I only tried this on my Linux machine where I have LC_CTYPE set to en_US.UTF-8 (like on your Linux box) and it failed. On your Windows machine you have LC_CTYPE set to German_Germany.1252, which seems to be OK. The purpose of setting LC_CTYPE to C is to temporarily overwrite its current value to make sure that nchar(rawToChar(as.raw(byte)), type="chars") will succeed for any value of byte between 0 and 255.

BTW I'm not sure that also setting LC_COLLATE to C helps in that particular situation so I might remove this.