Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

findPalindromes and palindromeArmLength - inconsistent arguments #31

Closed jayoung closed 4 months ago

jayoung commented 5 years ago

hi there,

I was excited to see there's a findPalindromes function - this could be really useful for a project I'm working on at the moment.

I'll probably want to use some combination of findPalindromes and palindromeLeftArm (/RightArm). But it looks like findPalindromes and palindromeLeftArm/palindromeRightArm/palindromeArmLength are not behaving consistently with one another.

Seems like it would be sensible for all four palindrome functions to behave the same as each other (with same arguments and same defaults) - is that easy to implement?

Some code to show my problem is below.

Also, I intend to use the function a bunch of times on a bunch of different sequences. I imagine it might get slow, so I'm wondering if there could be a way to only call the function once and get all features of the palindrome hit returned at once? Right now I think I need to call findPalindromes to see the start and end positions of the palindrome, and then either palindromeArmLength or palindromeLeftArm/palindromeRightArm to determine the extent of each arm (because I'm allowing a loop, the arms are not simply half of the seqlength).

thanks, as always,

Janet

library(Biostrings)

#### set up test data
# pal1 is a perfect 10 char palindrome with a 3 char loop
# pal2 is the same thing, but with one mismatch between arms
# pal3 has two mismatches between arms
pal1 <- BString("abcdefghijNOPjihgfedcba")
pal2 <- BString("abcXefghijNOPjihgfedcba")
pal3 <- BString("aXcXefghijNOPjihgfedcba")

#### the perfect palindome with loop (pal1) behaves OK:
findPalindromes(pal1, max.looplength=3)
#   Views on a 23-letter BString subject
# subject: abcdefghijNOPjihgfedcba
# views:
#     start end width
# [1]     1  23    23 [abcdefghijNOPjihgfedcba]

palindromeLeftArm(pal1, max.looplength=3)
#   10-letter "BString" instance
# seq: abcdefghij
palindromeRightArm(pal1, max.looplength=3)
#   10-letter "BString" instance
# seq: jihgfedcba

#### but for pal2...
## findPalindromes looks OK - it's controllable using the args as I would like:
findPalindromes(pal2, max.looplength=3)
#   Views on a 23-letter BString subject
# subject: abcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  19    15 [efghijNOPjihgfe]

## but palindromeLeftArm / palindromeRightArm / palindromeArmLength don't pay attention to the criteria I'd like to supply, even though they don't raise a warning about unused arguments

palindromeLeftArm(pal2, max.looplength=3)
#   3-letter "BString" instance
# seq: abc

palindromeRightArm(pal2, max.looplength=3)
#  3-letter "BString" instance
# seq: cba

palindromeArmLength(pal2, max.looplength=3)
# [1] 3

palindromeArmLength(pal2, max.looplength=3, min.armlength=4)
# [1] 3

## gets a bit ridiculous to report super-short palindromes like this:
palindromeLeftArm(pal3, max.looplength=3, min.armlength=4)
#   1-letter "BString" instance
# seq: a

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.53.0   XVector_0.25.0      IRanges_2.19.10     S4Vectors_0.23.17   BiocGenerics_0.31.4

loaded via a namespace (and not attached):
[1] zlibbioc_1.31.0 compiler_3.6.1 
hpages commented 5 years ago

Hi Janet,

It's a 2-step process: you first use findPalindromes() to find all the palindromes in a given (possibly long) sequence, then you use palindromeLeftArm(), palindromeRightArm(), and/or palindromeArmLength() on the palindromes returned by findPalindromes() to extract the arms and/or arm lengths.

So in your case:

x2 <- BString("abcXefghijNOPjihgfedcba")
# Note that this is your 'pal2' sequence but I renamed it because it is NOT a
# palindrome (if mismatches are not allowed). This is why calling palindromeLeftArm()
# directly on 'x2' won't make much sense.

# You need to first use findPalindromes() to find the palindrome(s):
pal2 <- findPalindromes(x2, max.looplength=3)
pal2
#   Views on a 23-letter BString subject
# subject: abcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  19    15 [efghijNOPjihgfe]

# Then call palindromeLeftArm() on the result:
palindromeLeftArm(pal2)
#   Views on a 23-letter BString subject
# subject: abcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  10     6 [efghij]

Same thing with your pal3 sequence (also renamed here for clarity):

x3 <- BString("aXcXefghijNOPjihgfedcba")

pal3 <- findPalindromes(x3, max.looplength=3)
pal3
#   Views on a 23-letter BString subject
# subject: aXcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  19    15 [efghijNOPjihgfe]

palindromeLeftArm(pal3, max.looplength=3)
#   Views on a 23-letter BString subject
# subject: aXcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  10     6 [efghij]

There are plenty of examples in the man page that illustrate this.

Hope this makes sense. H.

jayoung commented 5 years ago

oh, I get it now! Thanks. I hadn't understood that from ?findPalindromes

Would it make sense to disallow running palindromeLeftArm/palindromeRightArm/palindromeArmLength functions directly on XString objects, like I tried to do? There are also examples in the manpage of running them directly on XStrings, albeit without any loops/mismatches, but it may have contributed to leading me astray

hpages commented 5 years ago

Well, there are situation where you might want to run these functions on an XStringSet or XString object e.g. if you have a FASTA file containing sequences that are already known to be palindromes. Typical situation would be to run findPalindromes() on a full genome, which will take a lot of time (this is the expensive operation), and save the results in a FASTA file. Then later, when you do your downstream analysis, read the FASTA file into an XStringSet object (with readBStringSet() or family) and use palindromeLeftArm(), palindromeRightArm(), or palindromeArmLength() on the XStringSet object.

hpages commented 5 years ago

I've tried to clarify this in the man page in Biostrings 2.53.3 (commit 00a94780d524515fc854d83ea72ccf35c04c41a8).

H.

jayoung commented 5 years ago

Thanks for clarifying the man page - that'll help.

In your use case where you run findPalindromes() on a full genome and then later run palindromeLeftArm() etc on the fasta files, it seems to me like you'd probably want to use the same parameters for calling palindromes that you'd used the first time around (specifying arm length, loop lengths, etc). If it would be straightforward to add the same options to those secondary functions, it seems worth doing. What do you think?

jayoung commented 3 years ago

hi there,

try using the ranges() function on your findPalindromes result, perhaps as.data.frame(ranges()).

Does that work?

Janet

On May 6, 2021, at 1:20 AM, meyerlaker @.***> wrote:

Hello! As @hpages suggested, I am trying to findPalindromes in a whole squence and save the output to a fasta file but I'm having issues exporting the results. I've tried saving them as a data frame and exporting via export.fasta() from bios2mds but the data frame only has one column with the palindromes and not the locations and I really need the locations tied to the palindromes. Can you help me? I am new to programming and bioinformatics so sorry if its a dumb question or an obvious answer ;-)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

hpages commented 3 years ago

@meyerlaker @jayoung I don't see @meyerlaker post here, could it be that they've deleted it? Anyways, they've posted their question again on our support site: https://support.bioconductor.org/p/9136884/ See my answer there.

H.

jayoung commented 3 years ago

yes, support is a better place for it. I was a bit confused that I didn't see it here, just in my email.

I like your new as.data.frames!

meyerlaker commented 3 years ago

Thanks for your responses @hpages @jayoung , I've got it working. Yes I deleted here as I didnt really think it was an issue with the package as such and was more appropriate over there although as it turns out you have modified it for the next version. Nice!

hpages commented 3 years ago

@meyerlaker No problem.