Closed donkirkby closed 2 years ago
As an alternative to fixing the C++ code we can investigate using alternative implementations. String alignment algorithms are a well explored area in bioinformatics and there should be a lot of options to choose from, allowing for us to select for:
An alternative implementation allows us to reconsider the usage of align_it. When align_it is called we recompute the pairwise score table. Instead, we could instantiate an alignment object that precomputes this table, and perhaps allocates space for the reference sequence once per seed, and then in the remapping loops calls the relevant aligner. For example:
aligners = [Aligner(hiv1a), Aligner(hiv1b), Aligner(hiv2), ...)]
for query in queries:
for aligner in aligners:
alignment = aligner(query)
...
I propose exploring an external, statically compiled library with Python bindings that ensures compatibility with MiCall. I am a particular fan of rust-bio: https://github.com/rust-bio/rust-bio, as it satisfies all of the requirements mentioned above.
Implementing a Python extension in rust is very easy, and a lot easier to avoid memory errors than in C/C++. Here is a proof of concept that satisfies most uses of align_it in about 30 lines of code (though without the hiv25 score table): https://github.com/jeff-k/pyrustbio/blob/master/src/lib.rs
rust-bio sounds interesting. Another option to consider is @ArtPoon's gotoh2 project, which is an implementation in C. He was testing it last year, but I haven't heard any recent news. It might be useful to compare our current code, gotoh2, and rust-bio for speed, memory, and correctness.
We also have the option to explore alternatives to local alignment. With consensus sequences that approach the size of the seeds, Gotoh's O(mn) time complexity can be expected to grow "drastically" faster than semi-global Smith-Waterman at O(min(m, n) w) for a small window w (https://docs.rs/crate/bio/0.17.0/source/src/alignment/pairwise/banded.rs).
Hi @jeff-k and @donkirkby - the last I worked on gotoh2, I had shaken out a few bugs and it seems to be fairly stable. It would be great if you give it a try and flag any errors or leaks for me to fix, but I would also understand if you go with a more actively maintained project!
Current plan: experiment with Biopython to see how its performance compares with our current Gotoh implementation, and also look at any differences in the alignment results. I'll start with one of the large samples mentioned in #435 and look for differences when merging read pairs.
If you call
align_it("", "some long sequence", ...)
, there is a memory leak in the C++ code. I worked around it in issue #409 by not calling the method when one of the strings is blank. However, it would be nice to fix the C++ code. I just didn't want to do it until Kive supports multiple versions of tools through the use of Docker images, as described in issue cfe-lab/Kive#681.Here's the patch to line 509 that fixes the leak:
Here's a test that shows the problem. Just add this to the end of
gotoh.cpp
. (I had to comment out the Ruby stuff to let it compile.)