Open ahl27 opened 3 months ago
Illustration of current functionality in this PR:
> ## going to have to improve this workflow, but for now we're stuck with as(as(as.raw(...), "XRaw"), "BString")
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
>
> ## Here's the misleading error
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width, :
embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'
>
> ## note that the backend values are still correct
> as.integer(x[1:10])
[1] 0 1 2 3 4 5 6 7 8 9
>
> ## New functionality, activated by setting this option to TRUE
> options(Biostrings.showRaw=TRUE)
> x
256-letter BString object
seq: ⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !"#...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> as.integer(x[1:10])
[1] 0 1 2 3 4 5 6 7 8 9
> as.character(x)
[1] "⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿"
>
> ## Consistency with other objects, showing expected behavior on a DNAString
> y <- DNAString(paste(DNA_ALPHABET, collapse=''))
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=TRUE)
[1] "ACTMR"
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=FALSE)
[1] "AC" "TMR"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=TRUE)
[1] "ATY"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=FALSE)
[1] "A" "T" "Y"
>
> ## Now for BString
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=TRUE)
[1] "⣿⡠⡡⡢⡣CDEFGHIJKL⡿⢀⢁⢂⢃"
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=FALSE)
[1] "⣿⡠⡡⡢⡣" "CDEFGHIJKL" "⡿⢀⢁⢂⢃"
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=TRUE)
[1] "⣿⡢⡧C⡿"
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=FALSE)
[1] "⣿" "⡢" "⡧" "C" "⡿"
Invalid characters are displayed differently depending on machine support:
?
. XVector:::extract_character_from_XRaw_by_positions
or XVector:::extract_character_from_XRaw_by_ranges
to be able to identify null bytes.�
. While this doesn't allow distinguishing undisplayable characters, it does allow distinguishing all displayable characters from all undisplayable ones.Why do we care about which characters we can distinguish? These methods support conversions like as.character
. Ideally, people would use built-in methods for comparing strings, but people may want to work with character strings directly. In the current setup, character conversions of BString
objects aren't great--undisplayable values map to multibyte characters, which can cause lots of errors when trying to work with them as strings. Having a as.character
conversion between undisplayable bytes and parseable characters lets us work with them as strings, even if that isn't quite as clean as using the Biostrings
toolbox directly.
Another avenue to consider is to instead use something like as.raw
to directly convert the values of the BString
object and work with them that way. For instance, the following would also work:
> ## BSTRING_RAW_LOOKUP defined in zzz.R
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> test <- as.integer(as.raw(x))
> test[test==0] <- 256L
> paste(BSTRING_RAW_LOOKUP[test], collapse='')
We could use XVector:::subseq
to pull out subranges of the object, convert to raw, and then map them to characters manually.
There's a bit of a rabbit hole we could fall into with this fix. Ideally, we just fix the show
method to display non-displayable characters correctly and move on. I think there's a (valid) argument to be made that XString
objects are intended for strings, not for arbitrary input. Thus, does it make sense to add a lot of additional functionality for objects that aren't strings? Would it not make more sense to encourage usage of XRaw
if people are using lots of raw vectors?
THIS IS A WIP
Early stage work-in-progress solution for discussion in #65.
Goals:
BString
object contains a null byte (as.raw(0)
)BString
andraw/integer
vectorsThe issue in (1) is that a
BString
containing a null byte throws an error whenever R tries to display the position(s) containins the null byte. This leads users to think that the object itself is corrupted, when in fact it's purely an error with R attempting to display it on the console.as.integer()
and other methods work properly on these objects.This issue would be more obvious if more people used
BString
objects with values in the complete 0:255 range, but building aBString
from bytes is currently very difficult. Helper functions to createBString
s from raw vectors would be preferable.This PR will be blocked from merging until it's actually finalized; currently displaying to organize thoughts and reprexes.