Open diegozea opened 5 years ago
How do we know it's BioJulia and not biopython that has the bug?
The key difference seems to be the BioJulia version inserts a bigger gap near the middle, to match more elements furthur along the sequence.
Is it possible to print out how the BioJulia and the biopython algorithms fill out the dynamic programming matrix? If they're the same, then theres something different in how traceback occurs, if they differ, then something is being scored differently.
The issue here is that BioJulia handles gap open and extend penalties a bit differently from BioPython. A gap of size k
costs gap_open + k * gap_extend
in BioJulia. The same gap costs gap_open + (k - 1) * gap_extend
in BioPython. So, in order to test them against each other, one would have to subtract one extension penalty from the open penalty in Python, i.e. use -11 in this example:
pairwise2.align.localds(str1, str2, matrix, -11, -1)
Is there a reason to prefer gap_open + k * gap_extend
over gap_open + (k - 1) * gap_extend
?
From what I've seen in the literature, original papers from the 80's and 90's mostly use the gap_open + k * gap_extend
notation. One could argue that gap_open + (k - 1) * gap_extend
is somewhat cleaner, because it shows that the penalty for the first character in the gap is gap_open
(as opposed to the other notation, where the penalty for the first character is gap_open + gap_extend
).
This is one of those cases though where once you've picked a notation you don't want to change it, because if you do everybody's code will silently stop working after upgrading.
We should keep this issue pinned, perhaps until we add a page to the docs answering FAQs like this: "Hey why does BioAlignments give a different result to XYZ?"
While I was benchmarking Julia's
pairalign
(318 μs) against BioPython (7310 µs) and scikit-bio (925 µs) implementation, I found that Julia's local alignment returns a solution with score 63 while python solutions have a score of 65. That makes me think that we have some hidden bug there, and we are not hitting the right solution.Python 3.5.2
In this case, BioJulia's solution is part of BioPython's first solution.