fix: Issue #65 - Githubissues

Illustration of current functionality in this PR:

> ## going to have to improve this workflow, but for now we're stuck with as(as(as.raw(...), "XRaw"), "BString")
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
>
> ## Here's the misleading error
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'
>
> ## note that the backend values are still correct
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
>
> ## New functionality, activated by setting this option to TRUE
> options(Biostrings.showRaw=TRUE)
> x
256-letter BString object
seq: ⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !"#...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
> as.character(x)
[1] "⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿"
> 
> ## Consistency with other objects, showing expected behavior on a DNAString
> y <- DNAString(paste(DNA_ALPHABET, collapse=''))
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=TRUE)
[1] "ACTMR"
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=FALSE)
[1] "AC"  "TMR"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=TRUE)
[1] "ATY"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=FALSE)
[1] "A" "T" "Y"
> 
> ## Now for BString
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=TRUE)
[1] "⣿⡠⡡⡢⡣CDEFGHIJKL⡿⢀⢁⢂⢃"
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=FALSE)
[1] "⣿⡠⡡⡢⡣"      "CDEFGHIJKL" "⡿⢀⢁⢂⢃"     
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=TRUE)
[1] "⣿⡢⡧C⡿"
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=FALSE)
[1] "⣿" "⡢" "⡧" "C" "⡿"

Invalid characters are displayed differently depending on machine support:

By default, all undisplayable characters map to ?.
If unicode is supported, the braille character set is used to display values. Each bytevalue in 1:255 maps to a unique character. Note that 0 also maps to 255 due to machine restrictions. This could be solved, but would require an additional call to XVector:::extract_character_from_XRaw_by_positions or XVector:::extract_character_from_XRaw_by_ranges to be able to identify null bytes.
If unicode is not supported but multibyte characters are, all undisplayable characters map to �. While this doesn't allow distinguishing undisplayable characters, it does allow distinguishing all displayable characters from all undisplayable ones.

Why do we care about which characters we can distinguish? These methods support conversions like as.character. Ideally, people would use built-in methods for comparing strings, but people may want to work with character strings directly. In the current setup, character conversions of BString objects aren't great--undisplayable values map to multibyte characters, which can cause lots of errors when trying to work with them as strings. Having a as.character conversion between undisplayable bytes and parseable characters lets us work with them as strings, even if that isn't quite as clean as using the Biostrings toolbox directly.

Another avenue to consider is to instead use something like as.raw to directly convert the values of the BString object and work with them that way. For instance, the following would also work:

> ## BSTRING_RAW_LOOKUP defined in zzz.R
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> test <- as.integer(as.raw(x))
> test[test==0] <- 256L
> paste(BSTRING_RAW_LOOKUP[test], collapse='')

We could use XVector:::subseq to pull out subranges of the object, convert to raw, and then map them to characters manually.

There's a bit of a rabbit hole we could fall into with this fix. Ideally, we just fix the show method to display non-displayable characters correctly and move on. I think there's a (valid) argument to be made that XString objects are intended for strings, not for arbitrary input. Thus, does it make sense to add a lot of additional functionality for objects that aren't strings? Would it not make more sense to encourage usage of XRaw if people are using lots of raw vectors?

Bioconductor / Biostrings

fix: Issue #65 #112