Closed FelixErnst closed 7 months ago
Hi Felix,
Sorry for not responding earlier. You're right, the AAString()
constructor doesn't enforce the AA alphabet. This feature has been missing in Biostrings for many years and has been requested in a couple of occasions. It was not high on the priority list but it just got higher now and is on the list for the next devel cycle (BioC 3.8, will start right after the BioC 3.7 release).
A simple way to enforce a given alphabet without using any of the infrastructure defined in Biostrings is by doing something like this:
my_alphabet <- letters
all(safeExplode("abbabe") %in% my_alphabet)
# [1] TRUE
all(safeExplode("abbabeT") %in% my_alphabet)
# [1] FALSE
safeExplode(x)
is an internal utility defined in S4Vectors that is equivalent to strsplit(x, "")[[1]]
, just slightly faster:
input <- paste0(sample(letters, 25000000, replace=TRUE), collapse="")
nchar(input)
# [1] 25000000
all(safeExplode(input) %in% my_alphabet)
# [1] TRUE
all(safeExplode(input) %in% my_alphabet[-5])
# [1] FALSE
system.time(all(safeExplode(input) %in% my_alphabet))
# user system elapsed
# 1.031 0.144 1.176
However taking advantage of the Biostrings infrastructure will allow you to implement a much more efficient solution and to reduce memory footprint significantly. For example, the following function:
library(Biostrings)
from_string_to_XString <- function(x, Class, alphabet)
{
stopifnot(isSingleString(x),
isSingleString(Class),
is.character(alphabet), !anyNA(alphabet))
keys <- vals <- as.integer(charToRaw(paste0(alphabet, collapse="")))
lkup <- Biostrings:::buildLookupTable(keys, vals)
.Call2("new_XString_from_CHARACTER", Class, x, 1L, nchar(x), lkup,
PACKAGE="Biostrings")
}
is much faster:
system.time(x <- from_string_to_XString(input, "BString", my_alphabet))
# user system elapsed
# 0.085 0.000 0.086
x
# 25000000-letter "BString" instance
# seq: zewhitrddfgfmewnmklixkbezozahgpesivh...ozjwojpyuhtscmyyiailutjoxwtdxlppzvsn
and produces an informative error message on the first invalid input letter:
from_string_to_XString(input, "BString", my_alphabet[-5])
# Error in .Call2("new_XString_from_CHARACTER", Class, x, 1L, nchar(x), :
# key 101 (char 'e') not in lookup table
After you've defined your own XString extension, you can use the above function to quickly construct instances of this new class and enforce the alphabet:
setClass("ZString", contains="XString")
z <- from_string_to_XString(input, "ZString", my_alphabet)
Note that the "show"
method for XString objects doesn't work yet on ZString objects:
z
# 25000000-letter "ZString" instance
# Error in (function (classes, fdef, mtable) :
# unable to find an inherited method for function ‘seqtype’ for signature ‘"ZString"’
To make it work, you need to make as.character()
work on them, which can be done by first implementing coercion from ZString to BString:
setAs("ZString", "BString", function(from) {class(from) <- "BString"; from})
Then:
setMethod("as.character", "ZString", function(x) as.character(as(x, "BString")))
Now show()
works:
z
# 25000000-letter "ZString" instance
# seq: zewhitrddfgfmewnmklixkbezozahgpesivh...ozjwojpyuhtscmyyiailutjoxwtdxlppzvsn
Hope this helps, H.
Hi Herve,
Thanks for the examples. This is in part what I was looking for.
I looked at the funtions defined for the Xstring subclasses and I think I have all the things covered such as as.character()
.
Also, I will have a lookup at the behavior of from_string_to_XString
, but I think this will not work directly since buildLookupTable
expects an atomic vals vector which charToRaw(„=“)
for example does not produce. This examples is also reminiscent of the way the XStringCodec objects are created, isn't it? (via get_seqtype_conversion_lookup
)
I don’t know if the single integer is a prerequisite for the C backend, but with this check for an atomic vector in buildLookupTable
, only characters, which produces a single integer value using charToRaw
, can be used as part of an alphabet. The rest can not be used with the current setup.
This is actually what I am interested in, since the alphabets I want to use are a bit more complex. I also don't need to encode them via a lookup table necessarily, but the alphabet check would be nice.
I will test again via your examples and report back.
Felix
Hi Herve,
I am already stuck with the first example. I also have the greek letter "α" in my alphabet, which gets returned by safeExplode as "a".
my_alphabet
## [1] "a" "="
safeExplode(c("=====α"))
## [1] "=" "=" "=" "=" "=" "a"
This seems to me that I don't realize the full set of capabilities of R regarding string encoding and behavior. The Rmd HTML output also looks different to the console output (the first line it is shown as "α")
If the letter alpha trips me up, than I don't think it is wise to ask for features/changes in Biostrings at this point in time.
I think at this point in time I need to figure out the general R behavior before I can do anything useful with this. Sorry for taking your time regarding this case.
However, I hope the alphabet checking in general might be implemented in the next version. Maybe I can use it than. Thanks for all the great features in this and all the other packages.
Felix
Hi Felix,
α
is a multi-byte character:
> nchar("α", type="bytes")
[1] 2
> charToRaw("α")
[1] ce b1
> safeExplode("α")
[1] "\xce" "\xb1"
Biostrings only supports single-byte characters. When multi-byte characters are present in the input of the BString()
constructor, they are stored as 2 separate letters in the returned BString object:
> length(BString("ααα"))
[1] 6
H.
@hpages ,
I am running R version 3.5.0 with Bioconductor version '3.7'
I see above that this was slated to have been address in BioC 3.8
Did that actually happen? I see no mention of it in https://bioconductor.org/news/bioc_3_8_release/#news-from-new-and-existing-software-packages
Assuming it did not, I think this issue should be re-opened until it is addressed/fixed
Ultimately, at least the documentation should be fixed - It currently states unambiguously and incorrectly that "the AAString container can only store a string based on the Amino Acid alphabet"
I believed the documentation and coded something as though it were TRUE. The fact that it was not true led me on a wild goose chase. Recommend saving this pain for "the next guy".
In the mean time, looking for a best workaround, I looked at the proposed workaround from above but I don't see them as particularly helpful in my case as I am reading an readAAStringSet.
for finding the indices of AAStringSet members which violate AA_STANDARD, I'm doing this:
drop<-which(unlist(lapply(seq_along(pep),function(i) {! all(safeExplode(as.character(pep[[i]])) %in% AA_STANDARD )})))
Am I missing an optimization?
Sometimes with a hammer everything looks like a nail but
ok <- safeExplode(unlist(pep)) %in% AA_STANDARD
oklist <- relist(ok, pep)
gets a LogicalList with the same geometry as pep
; all(oklist)
tells which elements are bad pep[!oklist]
(or pep[!oklist][!all(oklist)]
) shows the bad letters
@malcook Sorry this didn't happen in BioC 3.8. I guess other priorities went in the way. I agree this issue should stay opened in the meantime.
FWIW an even more efficient solution is to use the from_string_to_XString()
function I provided earlier in this thread. Use it like this:
from_string_to_XString(input, "AAString", AA_ALPHABET)
But yes, this needs to go to Biostrings.
@mtmorgan - your hammer does not quite work in my hands because for starters the first assignment generates an error - you apparently cannot unlist an AAStringSet:
> pep
A AAStringSet instance of length 2
width seq
[1] 2 ML
[2] 10 [#&*$#&(*$
> ok <- safeExplode(unlist(pep)) %in% AA_STANDARD
Error in safeExplode(unlist(pep)) : 'x' must be a single string
however, we can as.character
it and then oklist winds up being the LogicalList you expected:
> ok <- safeExplode(as.character(unlist(pep))) %in% AA_STANDARD
> ok
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> oklist <- relist(ok, pep)
> oklist
LogicalList of length 2
[[1]] TRUE TRUE
[[2]] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
at which point, we can proceed as you suggest, and, as well, most usefully to me:
pep[oklist]
returns version of the AAStringList with bad letters simply removed (should that be desirable)
@malcook Note that you can unlist()
an AAStringSet, which gives you an AAString. What you can't do is pass an AAString to safeExplode()
. So yes, you need to turn it into an ordinary character string first.
I realize now that the from_string_to_XString
-based solution I provided previously would only be helpful for safely turning a single ordinary string into an AAString. To safely turn an ordinary character vector of arbitrary length into an AAStringSet, a couple extra steps are needed:
unlisted_aa <- from_string_to_XString(paste0(input, collapse=""), "AAString", AA_ALPHABET)
aa <- relist(unlisted_aa, PartitioningByWidth(nchar(input)))
This should be a lot more efficient than trying to validate the AAStringSet object a posteriori with safeExplode()
.
Fixed in PR #97
Thanks @ahl27 for solving this.
I worte on the BioC devel list a few days ago about a problem/behavior I have/I encounter with the AAString class. I haven't received a reply, so I am adding this as an issue, because in my opinion it is one. Thanks for having a look at it.
Below are the contents of the mail describing the issue:
I tried the following code, which should according to ?AAString not work, since ÜÖÄ are not part of any AA code.
I don’t have access right now to the devel version of Biostrings, bit I checked out the current Code in the github repo and its recent changes. I am pretty sure, that this behavior is also in the current devel branch. Can someone confirm this?
My current interest is in using the XString classes and methods for an additional biological string representation. The initial question was, how can I restrict this to a certain character set, if the characters are not saved byte encoded? The latter option is not available to me, since characters like ‚«‘ or ‚=‘ result in a two byte code using the charToRaw function. This trips up the build of the internal lookup table, which are passed down to the C backend.
Therefore I looked into, how this is done for an AAString differing from a BString. I discovered, that it currently doesn‘t. I also looked into the current 2.47.12 repo, which as far as I can tell does not use the AMINO_ACID_CODE constant in the creation of an AAString object.
So my questions are:
Is there a way to use byte encoding for chars with two ore more bytes?
Thanks in advance for any help and suggestions.
Best regards, Felix
PS: regarding the second question: One could change „as.integer(charToRaw(paste(letters, collapse="")))“ to „lapply(lapply(letters,charToRaw),as.integer)“ in .letterAsByteVal, but in any case it will not be atomic anymore, which I think is required to be excepted by the C backend. I didn’t test it.