Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
54 stars 16 forks source link

fix: Enforcing character set for `AAString` objects #97

Closed ahl27 closed 1 year ago

ahl27 commented 1 year ago

Adds a new codec for AAString objects to enforce the character set, resolving bugs #84 and #10

Screen Shot 2023-03-28 at 11 17 23

Lowercase characters are correctly converted to uppercase, and characters not in AA_ALPHABET (including multibyte characters) throw errors.

Let me know if the coding style looks okay, RStudio has been doing some funky stuff auto-reformatting all my tabs on any file I open (and ignoring my requests to use 4-length tabs!) but I think I've fixed it for future PRs.

ahl27 commented 1 year ago

Also has some other fixes for various minor things:

ahl27 commented 1 year ago

I'm going to modify the codec values so allow for bit-wise comparisons so we can also support fixed pattern matching as requested in #34. Only 4 ambiguity codes exist in the set, so it should be possible to do in less than 16 bits.

edit: this doesn't seem to be the case. Trying to think of a solution that would support bitwise comparisons as in DNAString or RNAString. The current implementation of buildLookupTable has to allocate a decode vector of length equal to the maximum value in the set...unfortunately this means that if we store each value as a separate power of two, we'll need a vector of size 2^25, which causes a bunch of issues.

Doing this in fewer bits is possible, but tricky. A bytewise match scheme means that we would need values such that any non-ambiguous code has an overlap of zero. Maybe there's a trick that could be done to transform lower dimensional space into higher prior to comparison in matchPattern, like a 5-to-25 demultiplexer implemented in code.

This runs into an additional problem, that the lookup table is only 256x256 bits. This means we can't use larger than 8 bit numbers without completely overhauling the lookup tables (which I'm very hesitant to do). The simplest solution may be to just create a fifth lookup table in lowlevel_matching.c such that all entries are FALSE except for the diagonal, the row/column corresponding to X (currently 26), and then the 4 ambiguity codes ((3,4), (6,7), (10,11) and reverse).

ahl27 commented 1 year ago

Update on the above--this doesn't seem to be as trivial as I had initially expected; I'll work on it for a separate PR. Including it with this would be too much for a single submission. The current codec values are fine, if they need to be changed it can be done in the future.

hpages commented 1 year ago

Thanks @ahl27

The question is: do we really want to encode the letters of an AAString object at construction time? What's the benefit? It seems to me that we could simply use the codec facilities to check that the input letters belong to AA_ALPHABET or tolower(AA_ALPHABET), and to turn the lowercase ones to uppercase, but not to actually encode anything. By doing this we wouldn't need to decode anything either, so we don't break existing serialized AAString or AAStringSet objects (trying to decode them would be problematic).

Also I suspect that there are many places in Biostrings code (and probably in packages that depend on Biostrings) that just assume no internal ecoding of the AA letters, so changing this would probably mean breaking/fixing a lot of code. For example:

AAString("AURRVX") == BString("AURRVX")
# [1] FALSE  # should be TRUE!

Unfortunately, a major problem for embarking into serious refactoring of the package is that it totally lacks good unit tests. I'm not proud of that :disappointed: That's actually something that should be put near the top of the list of things to improve.

Other than that, please add _AAencode/_AAdecode to the Biostrings C interface like for _DNAencode/_DNAdecode(but maybe _AAdecode should be removed if we don't decode). See Biostrings_interface.h and _Biostrings_stubs.c in Biostrings/inst/include/ for the details. Stricly speaking, registration via REGISTER_CCALLABLE() is enough to expose these C functions and make them callable from other packages. However this is very unsafe since the compiler has no way to check that the function is properly called (i.e. with correct nb of arguments, correct argument types, and correct returned value). The extra work in Biostrings/inst/include/ is meant to address that by providing prototypes for the registered functions.

Thanks again!

ahl27 commented 1 year ago

That's a fair point, I hadn't thought about that. Would it be sufficient to just "encode" the values as their ASCII equivalents? ex. encode A=65, C=67, ...? I feel like then existing strings should already be in the correct format on the backend, but I'm still not 100% familiar with the inner workings of the codecs 😅. BStrings seem to just be stored as raw values from 0:255, so that should fix the issue mentioned.

I suppose the benefit of encode/decode would be consistency with DNA/RNA, but I'm realizing that the main advantage of the RNA/DNA schema is that you get the benefit of bitwise-comparisons for ambiguity codes.

Re: C code, I'll make those adjustments tomorrow! Thanks for the explanation, that definitely makes sense.

I also have a fix for #34 that I can PR after this feature is completed.

ahl27 commented 1 year ago

One consideration is using encode/decode may make it easier to resolve #93 by just using an alternate encode function to map the input into the standard AAString format, although there’d be a multicharacter issue so I’m not sure.

hpages commented 1 year ago

Would it be sufficient to just "encode" the values as their ASCII equivalents? ex. encode A=65, C=67, ...?

Yes, that's what I meant by not encoding the incoming letters. The codec would be something like this:

AAcodes <- function(baseOnly)
{
    if (!isTRUEorFALSE(baseOnly))
        stop("'baseOnly' must be TRUE or FALSE")
    letters <- AA_ALPHABET
    if (baseOnly)
        letters <- head(letters, n=20L)
    setNames(.letterAsByteVal(letters), letters)
}

.XStringCodec.AA <- function()
{
    codes <- AAcodes(FALSE)
    letters <- names(codes)
    extra_letters <- setdiff(tolower(letters), letters)
    extra_codes <- .letterAsByteVal(toupper(extra_letters))
    new("XStringCodec", letters, codes, extra_letters, extra_codes)
}

AA_STRING_CODEC <- .XStringCodec.AA()

I'm realizing that the main advantage of the RNA/DNA schema is that you get the benefit of bitwise-comparisons for ambiguity codes.

That's for sure one of them.

> Biostrings:::DNAcodes(TRUE)
A C G T 
1 2 4 8 
> Biostrings:::RNAcodes(TRUE)
A C G U 
1 2 4 8 

Another reason for encoding RNA/DNA sequences was to be able to switch back and forth between RNA and DNA without having to re-encode anything. Combined with the fact that Biostrings objects use pointers to share their sequence data and you get a coercion between RNA and DNA that doesn't need to copy the sequence data at all. This means that it's virtually costless whatever the size of the object. Merely for the fun of it though, as I don't know of any application that takes benefit of that!

Finally, I vaguely remember that maybe this encoding scheme was also somewhat driven by the requirements of the shift-or algorithm (one of the algos supported by matchPattern()), but I'm not sure. This was 17 years ago!

Re: C code, I'll make those adjustments tomorrow!

Excellent, thanks again!

BTW I'll be offline starting tomorrow (Wed) until the end of the week, chaperoning for a 3-day field trip to a remote area with no internet and no cell reception. So I won't be able to look again at this until next week.

ahl27 commented 1 year ago

Ah okay, all that makes sense! This is teaching me a lot haha, I appreciate the detailed explanation.

I’ll make that change and then keep working on some of the other open issues; no worries on responding, enjoy the field trip!

ahl27 commented 1 year ago

Fixes are pushed for these!

AAString("AURRVX") == BString("AURRVX")
# Now evaluates to TRUE

I've left in the decode logic, I felt like it made more sense to have both for consistency in the code and I was having trouble figuring out how to selectively disable AADecode_. There's probably a really simple way to do it that I'm just missing.

This is not completely backwards compatible, the following breaks:

x <- BString("abcd")
class(x) <- "AAString"
x
# Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
#  'x' contains an invalid byte (97 = char 'a') at position 1

Obviously users shouldn't be manually changing the classes, but this would be an issue if someone had previously saved an AAString with invalid letters, and then attempted to load it. I'm not sure how common this scenario is...on the one hand I'd prefer not to break older saved objects, but on the other I feel like it would be good to completely enforce the AA_ALPHABET. Users writing packages that make use of Biostrings could be confused by the presence of just AAencode_ but not AAdecode_.

Older saved objects can always be forward converted with a small helper function like:

fix_old_aastring <- function(aa){
  class(aa) <- 'BString'
  AAString(aa)
}

I'm not sure, let me know what you'd like to see for the package.

hpages commented 1 year ago

Thanks @ahl27

I've left in the decode logic,

Sounds good.

on the one hand I'd prefer not to break older saved objects

We're going to take that risk, hoping that not too many saved AAString/AAStringSet objects contain letters not in AA_ALPHABET. Your little fix_old_aastring() helper will be handy to fix objects that contain valid lower case letters, but serialized objects that contain anything not in union(AA_ALPHABET, tolower(AA_ALPHABET)) won't be fixable.

BTW Bioconductor defines the updateObject() generic function for fixing old saved objects, with methods defined for many objects in many packages. We should probably add updateObject() methods for AAString, AAStringSet, and AAStringSetList objects. They would just do what fix_old_aastring() does, but possibly with an informative error if an object cannot be fixed (the error produced by fix_old_aastring is a little too dry). The best we can do is have the error message explain the situation and apologize.

We should probably have alphabetFrequency() for AAString/AAStringSet objects adopt the model used for DNAString/DNAStringSet objects, that is, it should support baseOnly and add the other column to the result ony when baseOnly=TRUE:

> dna <- DNAStringSet(c("AAA", "NAGGATND"))

> alphabetFrequency(dna)
     A C G T M R W S Y K V H D B N - + .
[1,] 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 2 0 2 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0

> alphabetFrequency(dna, baseOnly=TRUE)
     A C G T other
[1,] 3 0 0 0     0
[2,] 2 0 2 1     3

> aa <- AAStringSet(c("VVVVRVVVVAAV", "RVUUAS"))

> alphabetFrequency(aa)
     A R N D C Q E G H I L K M F P S T W Y V U O B J Z X * - + . other
[1,] 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0     0
[2,] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 0     0

> alphabetFrequency(aa, baseOnly=TRUE)
Error in .local(x, as.prob, ...) : unused argument (baseOnly = TRUE)

BTW I'm hesitant to merge this PR for the 3.17 release, timing is not ideal. Hope you don't mind if we aim for the beginning of the BioC 3.18 devel cycle for the merge.

Thanks again.

ahl27 commented 1 year ago

Thanks for the feedback! That all makes sense.

I'll add in updated methods for alphabetFrequency, and see if I can get the updateObject method working.

I'm hesitant to merge this PR for the 3.17 release, timing is not ideal. Hope you don't mind if we aim for the beginning of the BioC 3.18 devel cycle for the merge.

Absolutely, sounds great!

ahl27 commented 1 year ago

This has been updated with the requested changes, as well as a small adjustment to the corresponding man page.

alphabetFrequency

> aa <- AAStringSet(c("VVVVRVVVVAAV", "RVUUAS"))

> alphabetFrequency(aa)
     A R N D C Q E G H I L K M F P S T W Y V U O B J Z X * - + . other
[1,] 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0     0
[2,] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 0     0

> alphabetFrequency(aa, baseOnly=TRUE)
     A R N D C Q E G H I L K M F P S T W Y V other
[1,] 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9     0
[2,] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1     2

hasOnlyBaseLetters

> aa <- AAStringSet(c("VVVVRVVVVAAV", "RVUUAS"))

> hasOnlyBaseLetters(aa[[1]])
TRUE

> hasOnlyBaseLetters(aa[[1]])
FALSE

> hasOnlyBaseLetters(aa)
FALSE

updateObject

Valid input:

> ex <- BString("aabbccdd")
> class(ex) <- "AAString"
> ex
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  'x' contains an invalid byte (97 = char 'a') at position 1

> updateObject(ex)
8-letter AAString object
seq: AABBCCDD

> updateObject(AAStringSet(ex))
AAStringSet object of length 1:
    width seq
[1]     8 AABBCCDD

> updateObject(AAStringSetList(AAStringSet(ex)))
AAStringSetList of length 1
[[1]] AABBCCDD

Invalid input:

> ex <- BString("¢£")
> class(ex) <- "AAString"
> ex
2-letter AAString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  'x' contains an invalid byte (-62 = char '�') at position 1

> updateObject(ex)
Error in updateObject(ex) : 
  Cannot decode, AAString contains invalid character(s): ¢, Â

I'm having issues with displaying the correct characters--I'm guessing this has something to do with how characters are stored internally. However, the error message is still a lot more informative than previously.

hpages commented 1 year ago

Thanks. I don't think alphabetFrequency(aa, baseOnly=FALSE) should return the other column, in the same manner that alphabetFrequency(dna, baseOnly=FALSE) doesn't return it either. This column only makes sense when baseOnly=TRUE. Now that the AA_ALPHABET is enforced, this column is guaranteed to contain zeros when baseOnly=FALSE.

ahl27 commented 1 year ago

Oops, thanks for the catch. I'm actually not sure how that ended up in the output--the current version is working as you mentioned:

> aa <- AAStringSet(c("VVVVRVVVVAAV", "RVUUAS"))
> alphabetFrequency(aa, baseOnly=FALSE)
     A R N D C Q E G H I L K M F P S T W Y V U O B J Z X * - + .
[1,] 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0 0 0 0

I'm not sure how I managed to get the other column in the previous example, but it seems to be resolved in the current version.

hpages commented 1 year ago

@ahl27 I went ahead and corrected Erik's name spelling in commit 2b139b976ce13211b878cafb87e997209075a778. This introduces a conflict with this PR that you should be able to resolve by re-syncing your fork, merging the devel branch into your local AAStringChars branch, and then push.

ahl27 commented 1 year ago

Fixed! Thanks, Erik will be happy to hear that haha

hpages commented 1 year ago

Back to this after a long pause. Sorry for the slow response.

Here's some feedback about the 3 new updateObject() methods:

Thanks, H.

ahl27 commented 1 year ago

Ah okay, thanks for all the feedback! Very helpful for getting familiar with the underlying Biostrings code; I'll make these changes first thing tomorrow.

ahl27 commented 1 year ago

Just updated the relevant functions. I also kept the .updateObject_XStringSet(...) function you wrote as a separate function from updateObject.AAStringSet just in case it's needed for other updateObject functions in the future. If you'd prefer, I can fold it into the updateObject method for AAStringSet.

Testing:

AAString update

Screen Shot 2023-06-28 at 12 23 14

Screen Shot 2023-06-28 at 12 23 38

AAStringSet update

Screen Shot 2023-06-28 at 12 24 23 I don't think this is the most efficient way to generate a bunch of random AAString objects in an AAStringSet, but it gets the job done for a one-shot test. Average runtime for this version is around 1 second, compared to around 4.5 seconds for the previous version.

AAStringSetList update

Screen Shot 2023-06-28 at 12 32 03

hpages commented 1 year ago

All good. Thanks!