Closed digitalwright closed 3 years ago
Hi @hpages,
These changes should improve the Biostrings
palindrome finding functions in the following ways:
(1) Fixes the bug described in the recent pull request from @twmcart.
(2) Implements a stack to track mismatches. This maintains speed of the findPalindromes
function even after the fix above.
(3) Implements min.looplength as described in the comment above.
(4) Adds an allow.wobble argument to treat wobble base pairing as a match, as described in the comment above.
Thanks for considering these changes.
Erik
Hi @hpages,
FYI, I realized today that the original palindromeArmLength
function was impractically slow on XStringViews objects output by findPalindromes
. This has been fixed in the latest version. I hope that's it for changes! 👍
Erik
Hi @digitalwright , @twmccart,
Thanks for the patch and sorry for the delay. This is a great patch. It addresses some long standing issues with the findPalindromes()
code and add some nice features. In hindsight maybe I should have called the function findHairpins()
. Shorter and more descriptive of what's really going on with DNA sequences, especially with the notions of loop and arms.
@digitalwright I have little feedback, mostly cosmetic, I hope it's ok:
Maybe the man page could say what a "wobble base pair" is in the description of the allow.wobble
argument? Also maybe add a simple example in the examples section.
Use seqtype()
to detect the type of string data ("B", "DNA", "RNA", or "AA") instead of relying on whether L2R_lkup
is set to NULL or not. It will be more reliable and more readable. So:
if (allow.wobble && !(seqtype(subject) %in% c("DNA", "RNA")))
stop("subject must be DNA or RNA if 'allow.wobble' is TRUE")
instead of:
if (allow.wobble && is.null(L2R_lkup))
stop("subject must be DNA or RNA if 'allow.wobble' is TRUE")
palindromeArmLength(x, allow.wobble=TRUE)
should also error when x
is not DNA or RNA, for consistency with findPalindromes(..., allow.wobble=TRUE)
.
About the optimization of palindromeArmLength()
on XStringViews objects: a cleaner implementation is to turn the XStringViews object into an XStringSet object with fromXStringViewsToStringSet(x, out.of.limits="error")
(very cheap, does not copy the string data), and to modify .Call entry point palindrome_arm_length()
to make it work on an XStringSet object. Would look something like this (untested pseudo code!):
SEXP palindrome_arm_length(SEXP x, SEXP max_mismatch, SEXP allow_wobble, SEXP L2R_lkup)
{
...
XStringSet_holder x_holder = _hold_XStringSet(x);
int x_len = _get_XStringSet_length(x);
...
for (int i = 0; i < x_len; i++) {
Chars_holder x_elt_holder =
_get_elt_from_XStringSet_holder(&x_holder, i);
ans_p[i] = get_palindrome_arm_length(x_elt_holder.ptr,
x_elt_holder.length,
max_nmis, allow_wobble1,
lkup, lkup_len);
}
...
}
One advantage of this is that if we decided to add a palindromeArmLength()
method for XStringSet objects (and there is not reason a priori to not have one, it's just an oversight that we don't have one at the moment), then the methods for XStringsSet and XStringViews objects would share the same C code. Sharing C code between XStringsSet and XStringViews methods, whenever possible, is used in many places in Biostrings.
Other minor coding style things (some of them for good reasons, others only to keep things more inline with overall Biostrings style):
At the R level:
attributes(x)$subject
, which is the same as using direct slot access on an S4 object) when accessors are available.isTRUEorFALSE()
instead of is.logical()
when a single non-NA logical is expected.At the C level:
LENGTH()
instead of length()
to get the length of an atomic vector.NEW_INTEGER(ans_len)
instead of allocVector(INTSXP, ans_len)
.c1 == 2
). if (ismatch ||
nmis++ < max_nmis) {
...
}
Just:
if (ismatch || nmis++ < max_nmis) {
...
}
Thanks!
Hi @hpages,
Thank you very much for your thorough review of the suggested modifications to the palindrome finding functions. I made all of the changes you requested. Please let me know if you would like anything else before incorporating these changes into Biostrings.
Thanks, Erik
Hi Erik,
The latest round of changes introduces 2 regressions in palindromeArmLength()
:
The integer vector returned by the palindromeArmLength()
methods for XStringViews/XStringSet is now returned invisibly:
library(Biostrings)
v <- Views(BString("oacacTXYZcacao"), 1, 14)
palindromeArmLength(v) # nothing gets displayed
The palindromeArmLength()
methods for XString/DNAString/RNAString are broken:
subject(v)
# 14-letter BString object
# seq: oacacTXYZcacao
palindromeArmLength(subject(v))
# Error in .Call2("palindrome_arm_length", x, max.mismatch, allow.wobble, :
# no slot of name "elementType" for this object of class "BString"
That's because internal helper .palindrome_arm_length()
now expects an XStringSet derivative. Simply coerce to XStringSet to turn an XString derivative into an XStringSet derivative of length 1:
setMethod("palindromeArmLength", "XString",
function(x, max.mismatch=0, allow.wobble=FALSE)
{
x <- as(x, "XStringSet")
.palindrome_arm_length(x, max.mismatch, allow.wobble, NULL)
}
)
You're welcome to add your name to the man page if you'd like.
Thanks again for all the changes and improvements.
H.
Hi @hpages,
Thanks for catching that. All fixed now.
Erik
Looks good. Thanks! H.
The attached code makes the following modifications:
(1) min.looplength is now implemented. Previously specification of a non-zero value would throw an error saying "will be implemented very soon!". This has been the case for many years.
(2) I added a new argument to allow.wobble that will accept G/T and T/G mismatches as matches. This makes the function more practical for finding hairpins since wobble bases are frequent in real palindromes. The default is FALSE to keep the code backwards compatible.