Fix memory leak in alignment

donkirkby commented 7 years ago

If you call align_it("", "some long sequence", ...), there is a memory leak in the C++ code. I worked around it in issue #409 by not calling the method when one of the strings is blank. However, it would be nice to fix the C++ code. I just didn't want to do it until Kive supports multiple versions of tools through the use of Docker images, as described in issue cfe-lab/Kive#681.

Here's the patch to line 509 that fixes the leak:

for (i = 0; i < max(2, M + 1); i++)

Here's a test that shows the problem. Just add this to the end of gotoh.cpp. (I had to comment out the Ruby stuff to let it compile.)

void run_test(int should_report) {
    const char * standard = "";
    const char * seq =      "GACGAGGGGAATCGCAGCGGTGGGCAAAGTGGTGGCTGCCCGCCAGATCACGAGAGAGCT";
    int gap_init_penalty = 10;
    int gap_extend_penalty = 3;
    int use_terminal_gap_penalty = 1;
    int score;

//    if (!PyArg_ParseTuple(args, "ssiii", &standard, &seq, &gap_init_penalty, &gap_extend_penalty, &use_terminal_gap_penalty)) {
//        return NULL;
//    }
//
    init_pairscore(5, 4); // match, mismatch scores +5, -4 respectively (HyPhy defaults)

    string* seqa = new string(standard);
    string* seqb = new string(seq);
    trim(seqa);
    trim(seqb);
    //degap(seqa);
    //degap(seqb);
    string* newseqa = new string();
    string* newseqb = new string();

    score = align(seqa, seqb, newseqa, newseqb, gap_init_penalty, gap_extend_penalty, use_terminal_gap_penalty);

//    PyObject * retval = Py_BuildValue("ssi", newseqa->c_str(), newseqb->c_str(), score);
    if (should_report) {
        printf("Alignment:\nref=%s\nseq=%s\n", newseqa->c_str(), newseqb->c_str());
    }

    delete seqa;
    delete seqb;
    delete newseqa;
    delete newseqb;
}

int main() {
    for (int i = 0; i < 100000; i++) {
        run_test(0);
    }
    run_test(1);
}

jeff-k commented 6 years ago

As an alternative to fixing the C++ code we can investigate using alternative implementations. String alignment algorithms are a well explored area in bioinformatics and there should be a lot of options to choose from, allowing for us to select for:

Optimization: Experiments with this code demonstrate that native Python implementations and even python wrappers around compiled implementations (BioPython) are not fast enough for our purposes.
Memory safety: Dynamic programming and string manipulation make it easy to run into memory issues in languages that achieve the level of optimization that we want here. This is further complicated when we wrap non-garbage collected code in a Python extension.
Modularity, portability: My stake in this issue is that we have other applications that demand pairwise sequence alignment in our pipeline and risk reinventing this solution within the same repo. For example, a nicely modularized solution should also be able to support read pair merging. Portability is difficult for statically compiled Python extensions, so we want to make sure that such an extension is easy to deploy on new systems. This is related to the Docker tool versioning issue raised by Don.
Stability and support: A healthy project that has many maintainers and users is less at risk of becoming deprecated and more likely to dedicate attention to testing.

An alternative implementation allows us to reconsider the usage of align_it. When align_it is called we recompute the pairwise score table. Instead, we could instantiate an alignment object that precomputes this table, and perhaps allocates space for the reference sequence once per seed, and then in the remapping loops calls the relevant aligner. For example:

aligners = [Aligner(hiv1a), Aligner(hiv1b), Aligner(hiv2), ...)]
for query in queries:
    for aligner in aligners:
        alignment = aligner(query)
        ...

I propose exploring an external, statically compiled library with Python bindings that ensures compatibility with MiCall. I am a particular fan of rust-bio: https://github.com/rust-bio/rust-bio, as it satisfies all of the requirements mentioned above.

Implementing a Python extension in rust is very easy, and a lot easier to avoid memory errors than in C/C++. Here is a proof of concept that satisfies most uses of align_it in about 30 lines of code (though without the hiv25 score table): https://github.com/jeff-k/pyrustbio/blob/master/src/lib.rs

donkirkby commented 6 years ago

rust-bio sounds interesting. Another option to consider is @ArtPoon's gotoh2 project, which is an implementation in C. He was testing it last year, but I haven't heard any recent news. It might be useful to compare our current code, gotoh2, and rust-bio for speed, memory, and correctness.

jeff-k commented 6 years ago

We also have the option to explore alternatives to local alignment. With consensus sequences that approach the size of the seeds, Gotoh's O(mn) time complexity can be expected to grow "drastically" faster than semi-global Smith-Waterman at O(min(m, n) w) for a small window w (https://docs.rs/crate/bio/0.17.0/source/src/alignment/pairwise/banded.rs).

ArtPoon commented 6 years ago

Hi @jeff-k and @donkirkby - the last I worked on gotoh2, I had shaken out a few bugs and it seems to be fairly stable. It would be great if you give it a try and flag any errors or leaks for me to fix, but I would also understand if you go with a more actively maintained project!

donkirkby commented 6 years ago

Current plan: experiment with Biopython to see how its performance compares with our current Gotoh implementation, and also look at any differences in the alignment results. I'll start with one of the large samples mentioned in #435 and look for differences when merging read pairs.

cfe-lab / MiCall

Fix memory leak in alignment #410