Open jayoung opened 5 years ago
The problem is that in general there is more than one way to look at a palindrome when mismatches are allowed and occur at the beginning or end of the arms. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following palindrome:
abcdefghijNOPQRSTjihgfedcba
can be seen as:
abcdefghij # 10-letter left arm
NOPQRST # 7-letter loop
jihgfedcba # 10-letter right arm (no mismatch between the left and right arms)
or as:
abcdefghijN # 11-letter left arm
OPQRS # 5-letter loop
Tjihgfedcba # 11-letter right arm (1 mismatch between left and right arms)
Note that this 2nd form maximizes the arms and minimizes the loop.
The thing is that palindromeLeftArm()
, palindromeRightArm()
, and palindromeArmLength()
always report the form that maximizes the arms and minimizes the loop, because this is what makes more sense in general. Although I can see that it can be a little bit confusing when the mismatch is between the last letter of the left arm and the first letter of the right arm.
Note that something similar happens if the mismatch is between the first letter of the left arm and the last letter of the right arm. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following sequence:
XabcdcbaY
can be seen as containing the following palindrome:
abc # 3-letter left arm
d # 1-letter loop
cba # 3-letter right arm (no mismatch between the left and right arms)
or containing the following palindrome:
Xabc # 4-letter left arm
d # 1-letter loop
cbaY # 4-letter right arm (1 mismatch between left and right arms)
findPalindromes()
and palindromeLeftArm()
will find and report the palindrome with longest arms:
palindromeLeftArm(findPalindromes(BString("XabcdcbaY"), max.mismatch=1), max.mismatch=1)
# Views on a 9-letter BString subject
# subject: XabcdcbaY
# views:
# start end width
# [1] 1 4 4 [Xabc]
Hope this helps, H.
I can see the challenge! It could get to be more like an alignment problem, where you might want to have different scores/penalties for mismatches versus extending length.
In fact, in my particular use case, the palindromes are long (>50bp), and I'm looking within individual long sequencing reads for palindromes (where most of sequence reads will have palindromes, perhaps with loops, and containing mismatches between arms). So I will probably be better off using pairwiseAlignment after I split each sequence in half and reverse-complementing the second half. I'll experiment with that instead.
Just to clarify: That will only work if the center of the palindrome is at the center of the read sequence. But not if you want to find palindromes at arbitrary locations in the read sequences.
after a bit more testing, I'm not sure it's true that findPalindromes() will always find and report the palindrome with longest arms:
palindromeLeftArm(findPalindromes(BString("abcXefghijNOPQRSTjihgfedcba"), max.mismatch=1, max.looplength=8), max.mismatch=1)
# Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
# start end width
# [1] 5 11 7 [efghijN]
## 7-letter arms, 1 mismatch
## 5-letter loop OPQRS
whereas this alternative choice has longer arms (but also a shorter loop):
# abcXefghij 10-letter arms, 1 mismatch
# NOPQRST 7-letter loop
Is findPalindomes prioritizing minimizing loop length over maximizing arm length?
I can see there are many ways to do this, and I'm not sure what's the best way. But whatever way you choose, perhaps it would be good to include a little detail on the algorithm and scoring system in the man page, so that the method is more transparent. Someone using this naively (me!) may assume that there's an unambiguous best choice for how the best palindromes will be chosen. Perhaps also worth including a pointer to pairwiseAlignment, as that function might behave better for some combinations of arm length/loop length/mismatch number.
yes, in my (likely unusual) use case, the palindromes should be centered at the read center, so I think pairwiseAlignment on the halves should work. The molecular biology should have done the palindrome-finding for us with this data.
Just to clarify: That will only work if the center of the palindrome is at the center of the read sequence. But not if you want to find palindromes at arbitrary locations in the read sequences.
I'm not sure it's true that findPalindromes() will always find and report the palindrome with longest arms
It's expected to do that. If it's not then it's a bug. I need to fix that. Also will clarify this in the man page.
Is there any update on this bug? I am also seeing what appears to be prioritization of minimizing loop length instead of maximizing arm length. Could this be because when the C function get_find_palindromes_at() reports a match and resets arm_len
to 0, it does not also reset the available errors?
https://github.com/Bioconductor/Biostrings/blob/cb171e6af33f77a702964eea633ba19ea2360a50/src/find_palindromes.c#L41-L43
I think I will have a fix for this today or tomorrow. Could @hpages contact me about how you would like pull requests handled?
Hi @jayoung @twmccart,
Erik's patch, which includes Thomas's earlier proposed changes, addresses the problem of findPalindromes()
not always finding and reporting the palindrome with longest arms.
With Biostrings 2.59.2:
library(Biostrings)
pal2 <- BString("abcXefghijNOPQRSTjihgfedcba")
findPalindromes(pal2, max.looplength=7, max.mismatch=1)
# Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
start end width
[1] 5 23 19 [efghijNOPQRSTjihgfe]
[2] 1 27 27 [abcXefghijNOPQRSTjihgfedcba]
The palindrome with longest arm is here. I'm not sure about also reporting the shorter one, don't see a good reason for it. This can pollute the results quite a bit if we increase the value of max.mismatch
:
pal3 <- BString("abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba")
findPalindromes(pal3, max.looplength=7, max.mismatch=3)
# Views on a 43-letter BString subject
# subject: abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba
# views:
# start end width
# [1] 14 30 17 [abcdeNOPQRSTedcba]
# [2] 13 31 19 [CabcdeNOPQRSTedcbaX]
# [3] 12 32 21 [BCabcdeNOPQRSTedcbaXY]
# [4] 1 43 43 [abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba]
which could quickly become an annoyance. Something that should probably be addressed. But at least the modified algorithm no longer misses results.
H.
Thanks, all - this looks great.
One idea to handle the multiple short matches is to add an option to choose the palindrome with longest arms, or just let the user filter as they choose after the fact.
I've moved on (at least for now) from the project where I was using findPalindromes, so won't be testing this in any depth in the near future.
hi there,
Another thought about findPalindromes. Something about the way it treats mismatches between palindrome arms is a little unintuitive, biologically, while it is probably correct in the formal sense. My biologist's intuition is that mismatches should be tolerated WITHIN the palindrome arms. However, the code also allows mismatches at the termini of palindromes, so that palindromes get extended into the loop region. I want to tolerate mismatches so I can recognize diverged palindrome arms, but I don't want to artificially extend the palindrome arm into the loop. See code below.
I also found some weird behavior with one of my examples - also illustrated below.
thanks again,
Janet