Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
54 stars 16 forks source link

fix: Issue #65 #112

Open ahl27 opened 3 months ago

ahl27 commented 3 months ago

THIS IS A WIP

Early stage work-in-progress solution for discussion in #65.

Goals:

  1. solve misleading "error" when BString object contains a null byte (as.raw(0))
  2. improve UX for converting between BString and raw/integer vectors

The issue in (1) is that a BString containing a null byte throws an error whenever R tries to display the position(s) containins the null byte. This leads users to think that the object itself is corrupted, when in fact it's purely an error with R attempting to display it on the console. as.integer() and other methods work properly on these objects.

This issue would be more obvious if more people used BString objects with values in the complete 0:255 range, but building a BString from bytes is currently very difficult. Helper functions to create BStrings from raw vectors would be preferable.

This PR will be blocked from merging until it's actually finalized; currently displaying to organize thoughts and reprexes.

ahl27 commented 3 months ago

Illustration of current functionality in this PR:

> ## going to have to improve this workflow, but for now we're stuck with as(as(as.raw(...), "XRaw"), "BString")
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
>
> ## Here's the misleading error
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'
>
> ## note that the backend values are still correct
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
>
> ## New functionality, activated by setting this option to TRUE
> options(Biostrings.showRaw=TRUE)
> x
256-letter BString object
seq: ⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !"#...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> as.integer(x[1:10])
 [1] 0 1 2 3 4 5 6 7 8 9
> as.character(x)
[1] "⣿⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾ !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿"
> 
> ## Consistency with other objects, showing expected behavior on a DNAString
> y <- DNAString(paste(DNA_ALPHABET, collapse=''))
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=TRUE)
[1] "ACTMR"
> extract_character_from_XString_by_ranges(y, start=c(1L, 4L), width=c(2L,3L), collapse=FALSE)
[1] "AC"  "TMR"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=TRUE)
[1] "ATY"
> extract_character_from_XString_by_positions(y, c(1L, 4L, 9L), collapse=FALSE)
[1] "A" "T" "Y"
> 
> ## Now for BString
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=TRUE)
[1] "⣿⡠⡡⡢⡣CDEFGHIJKL⡿⢀⢁⢂⢃"
> extract_character_from_XString_by_ranges(x, start=c(1L, 68L, 128L), width=c(5L,10L,5L), collapse=FALSE)
[1] "⣿⡠⡡⡢⡣"      "CDEFGHIJKL" "⡿⢀⢁⢂⢃"     
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=TRUE)
[1] "⣿⡢⡧C⡿"
> extract_character_from_XString_by_positions(x, c(1L, 4L, 9L, 68L, 128L), collapse=FALSE)
[1] "⣿" "⡢" "⡧" "C" "⡿"

Invalid characters are displayed differently depending on machine support:

Why do we care about which characters we can distinguish? These methods support conversions like as.character. Ideally, people would use built-in methods for comparing strings, but people may want to work with character strings directly. In the current setup, character conversions of BString objects aren't great--undisplayable values map to multibyte characters, which can cause lots of errors when trying to work with them as strings. Having a as.character conversion between undisplayable bytes and parseable characters lets us work with them as strings, even if that isn't quite as clean as using the Biostrings toolbox directly.


Another avenue to consider is to instead use something like as.raw to directly convert the values of the BString object and work with them that way. For instance, the following would also work:

> ## BSTRING_RAW_LOOKUP defined in zzz.R
> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> test <- as.integer(as.raw(x))
> test[test==0] <- 256L
> paste(BSTRING_RAW_LOOKUP[test], collapse='')

We could use XVector:::subseq to pull out subranges of the object, convert to raw, and then map them to characters manually.

There's a bit of a rabbit hole we could fall into with this fix. Ideally, we just fix the show method to display non-displayable characters correctly and move on. I think there's a (valid) argument to be made that XString objects are intended for strings, not for arbitrary input. Thus, does it make sense to add a lot of additional functionality for objects that aren't strings? Would it not make more sense to encourage usage of XRaw if people are using lots of raw vectors?