Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

findPalindrome / palindromeLeftArm and mismatches - extends into loop #32

Open jayoung opened 5 years ago

jayoung commented 5 years ago

hi there,

Another thought about findPalindromes. Something about the way it treats mismatches between palindrome arms is a little unintuitive, biologically, while it is probably correct in the formal sense. My biologist's intuition is that mismatches should be tolerated WITHIN the palindrome arms. However, the code also allows mismatches at the termini of palindromes, so that palindromes get extended into the loop region. I want to tolerate mismatches so I can recognize diverged palindrome arms, but I don't want to artificially extend the palindrome arm into the loop. See code below.

I also found some weird behavior with one of my examples - also illustrated below.

thanks again,

Janet


library(Biostrings)

#### set up test data
# pal1 is a perfect 10 char palindrome with a 7 char loop
# pal2 is the same thing, but with one mismatch between arms
pal1 <- BString("abcdefghijNOPQRSTjihgfedcba")
pal2 <- BString("abcXefghijNOPQRSTjihgfedcba")

## without mismatches (pal1), things look good:
findPalindromes(pal1, max.looplength=7)
#   Views on a 27-letter BString subject
# subject: abcdefghijNOPQRSTjihgfedcba
# views:
#     start end width
# [1]     1  27    27 [abcdefghijNOPQRSTjihgfedcba]
palindromeLeftArm(pal1, max.looplength=7)
#   10-letter "BString" instance
# seq: abcdefghij

## allowing mismatches extends the palidrome into the loop. I want to tolerate mismatches within arms, but not to extend them.
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=1)
#   11-letter "BString" instance
# seq: abcdefghijN
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=2)
#   12-letter "BString" instance
# seq: abcdefghijNO
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=3)
#   13-letter "BString" instance
# seq: abcdefghijNOP

## and I found something weird (an error??) with pal2 (where there's a mismatch). palindromeLeftArm finds the longer palindrome when I allow a mismatch, but findPalindromes misses the longer palindrome and only reports the shorter one, truncating at the mismatch. Are they using different algorithms?   Maybe it's worth including a brief description of the algorithm used in the help.

findPalindromes(pal2, max.looplength=7, max.mismatch=1)
#   Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
#     start end width
# [1]     5  23    19 [efghijNOPQRSTjihgfe]

palindromeLeftArm(pal2, max.looplength=7, max.mismatch=1)
#   10-letter "BString" instance
# seq: abcXefghij

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.53.0   XVector_0.25.0      IRanges_2.19.10     S4Vectors_0.23.17   BiocGenerics_0.31.4

loaded via a namespace (and not attached):
[1] zlibbioc_1.31.0 compiler_3.6.1  tools_3.6.1    
hpages commented 5 years ago

The problem is that in general there is more than one way to look at a palindrome when mismatches are allowed and occur at the beginning or end of the arms. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following palindrome:

abcdefghijNOPQRSTjihgfedcba

can be seen as:

abcdefghij    # 10-letter left arm
NOPQRST       # 7-letter loop
jihgfedcba    # 10-letter right arm (no mismatch between the left and right arms)

or as:

abcdefghijN   # 11-letter left arm
OPQRS         # 5-letter loop
Tjihgfedcba   # 11-letter right arm (1 mismatch between left and right arms)

Note that this 2nd form maximizes the arms and minimizes the loop.

The thing is that palindromeLeftArm(), palindromeRightArm(), and palindromeArmLength() always report the form that maximizes the arms and minimizes the loop, because this is what makes more sense in general. Although I can see that it can be a little bit confusing when the mismatch is between the last letter of the left arm and the first letter of the right arm.

Note that something similar happens if the mismatch is between the first letter of the left arm and the last letter of the right arm. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following sequence:

XabcdcbaY

can be seen as containing the following palindrome:

abc     # 3-letter left arm
d      # 1-letter loop
cba    # 3-letter right arm (no mismatch between the left and right arms)

or containing the following palindrome:

Xabc   # 4-letter left arm
d      # 1-letter loop
cbaY   # 4-letter right arm (1 mismatch between left and right arms)

findPalindromes() and palindromeLeftArm() will find and report the palindrome with longest arms:

palindromeLeftArm(findPalindromes(BString("XabcdcbaY"), max.mismatch=1), max.mismatch=1)
#   Views on a 9-letter BString subject
# subject: XabcdcbaY
# views:
#     start end width
# [1]     1   4     4 [Xabc]

Hope this helps, H.

jayoung commented 5 years ago

I can see the challenge! It could get to be more like an alignment problem, where you might want to have different scores/penalties for mismatches versus extending length.

In fact, in my particular use case, the palindromes are long (>50bp), and I'm looking within individual long sequencing reads for palindromes (where most of sequence reads will have palindromes, perhaps with loops, and containing mismatches between arms). So I will probably be better off using pairwiseAlignment after I split each sequence in half and reverse-complementing the second half. I'll experiment with that instead.

hpages commented 5 years ago

Just to clarify: That will only work if the center of the palindrome is at the center of the read sequence. But not if you want to find palindromes at arbitrary locations in the read sequences.

jayoung commented 5 years ago

after a bit more testing, I'm not sure it's true that findPalindromes() will always find and report the palindrome with longest arms:

palindromeLeftArm(findPalindromes(BString("abcXefghijNOPQRSTjihgfedcba"), max.mismatch=1, max.looplength=8), max.mismatch=1)
#  Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
#     start end width
# [1]     5  11     7 [efghijN]
## 7-letter arms, 1 mismatch
## 5-letter loop OPQRS

whereas this alternative choice has longer arms (but also a shorter loop):

# abcXefghij  10-letter arms, 1 mismatch
# NOPQRST  7-letter loop

Is findPalindomes prioritizing minimizing loop length over maximizing arm length?

I can see there are many ways to do this, and I'm not sure what's the best way. But whatever way you choose, perhaps it would be good to include a little detail on the algorithm and scoring system in the man page, so that the method is more transparent. Someone using this naively (me!) may assume that there's an unambiguous best choice for how the best palindromes will be chosen. Perhaps also worth including a pointer to pairwiseAlignment, as that function might behave better for some combinations of arm length/loop length/mismatch number.

jayoung commented 5 years ago

yes, in my (likely unusual) use case, the palindromes should be centered at the read center, so I think pairwiseAlignment on the halves should work. The molecular biology should have done the palindrome-finding for us with this data.

Just to clarify: That will only work if the center of the palindrome is at the center of the read sequence. But not if you want to find palindromes at arbitrary locations in the read sequences.

hpages commented 5 years ago

I'm not sure it's true that findPalindromes() will always find and report the palindrome with longest arms

It's expected to do that. If it's not then it's a bug. I need to fix that. Also will clarify this in the man page.

twmccart commented 4 years ago

Is there any update on this bug? I am also seeing what appears to be prioritization of minimizing loop length instead of maximizing arm length. Could this be because when the C function get_find_palindromes_at() reports a match and resets arm_len to 0, it does not also reset the available errors? https://github.com/Bioconductor/Biostrings/blob/cb171e6af33f77a702964eea633ba19ea2360a50/src/find_palindromes.c#L41-L43

twmccart commented 4 years ago

I think I will have a fix for this today or tomorrow. Could @hpages contact me about how you would like pull requests handled?

hpages commented 3 years ago

Hi @jayoung @twmccart,

Erik's patch, which includes Thomas's earlier proposed changes, addresses the problem of findPalindromes() not always finding and reporting the palindrome with longest arms.

With Biostrings 2.59.2:

library(Biostrings)
pal2 <- BString("abcXefghijNOPQRSTjihgfedcba")
findPalindromes(pal2, max.looplength=7, max.mismatch=1)
# Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
      start end width
  [1]     5  23    19 [efghijNOPQRSTjihgfe]
  [2]     1  27    27 [abcXefghijNOPQRSTjihgfedcba]

The palindrome with longest arm is here. I'm not sure about also reporting the shorter one, don't see a good reason for it. This can pollute the results quite a bit if we increase the value of max.mismatch:

pal3 <- BString("abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba")
findPalindromes(pal3, max.looplength=7, max.mismatch=3)
# Views on a 43-letter BString subject
# subject: abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba
# views:
#       start end width
#   [1]    14  30    17 [abcdeNOPQRSTedcba]
#   [2]    13  31    19 [CabcdeNOPQRSTedcbaX]
#   [3]    12  32    21 [BCabcdeNOPQRSTedcbaXY]
#   [4]     1  43    43 [abcdefghijABCabcdeNOPQRSTedcbaXYZjihgfedcba]

which could quickly become an annoyance. Something that should probably be addressed. But at least the modified algorithm no longer misses results.

H.

jayoung commented 3 years ago

Thanks, all - this looks great.

One idea to handle the multiple short matches is to add an option to choose the palindrome with longest arms, or just let the user filter as they choose after the fact.

I've moved on (at least for now) from the project where I was using findPalindromes, so won't be testing this in any depth in the near future.