cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Fix memory leak in alignment #410

Closed donkirkby closed 2 years ago

donkirkby commented 7 years ago

If you call align_it("", "some long sequence", ...), there is a memory leak in the C++ code. I worked around it in issue #409 by not calling the method when one of the strings is blank. However, it would be nice to fix the C++ code. I just didn't want to do it until Kive supports multiple versions of tools through the use of Docker images, as described in issue cfe-lab/Kive#681.

Here's the patch to line 509 that fixes the leak:

for (i = 0; i < max(2, M + 1); i++)

Here's a test that shows the problem. Just add this to the end of gotoh.cpp. (I had to comment out the Ruby stuff to let it compile.)

void run_test(int should_report) {
    const char * standard = "";
    const char * seq =      "GACGAGGGGAATCGCAGCGGTGGGCAAAGTGGTGGCTGCCCGCCAGATCACGAGAGAGCT";
    int gap_init_penalty = 10;
    int gap_extend_penalty = 3;
    int use_terminal_gap_penalty = 1;
    int score;

//    if (!PyArg_ParseTuple(args, "ssiii", &standard, &seq, &gap_init_penalty, &gap_extend_penalty, &use_terminal_gap_penalty)) {
//        return NULL;
//    }
//
    init_pairscore(5, 4); // match, mismatch scores +5, -4 respectively (HyPhy defaults)

    string* seqa = new string(standard);
    string* seqb = new string(seq);
    trim(seqa);
    trim(seqb);
    //degap(seqa);
    //degap(seqb);
    string* newseqa = new string();
    string* newseqb = new string();

    score = align(seqa, seqb, newseqa, newseqb, gap_init_penalty, gap_extend_penalty, use_terminal_gap_penalty);

//    PyObject * retval = Py_BuildValue("ssi", newseqa->c_str(), newseqb->c_str(), score);
    if (should_report) {
        printf("Alignment:\nref=%s\nseq=%s\n", newseqa->c_str(), newseqb->c_str());
    }

    delete seqa;
    delete seqb;
    delete newseqa;
    delete newseqb;
}

int main() {
    for (int i = 0; i < 100000; i++) {
        run_test(0);
    }
    run_test(1);
}
jeff-k commented 6 years ago

As an alternative to fixing the C++ code we can investigate using alternative implementations. String alignment algorithms are a well explored area in bioinformatics and there should be a lot of options to choose from, allowing for us to select for:

An alternative implementation allows us to reconsider the usage of align_it. When align_it is called we recompute the pairwise score table. Instead, we could instantiate an alignment object that precomputes this table, and perhaps allocates space for the reference sequence once per seed, and then in the remapping loops calls the relevant aligner. For example:

aligners = [Aligner(hiv1a), Aligner(hiv1b), Aligner(hiv2), ...)]
for query in queries:
    for aligner in aligners:
        alignment = aligner(query)
        ...

I propose exploring an external, statically compiled library with Python bindings that ensures compatibility with MiCall. I am a particular fan of rust-bio: https://github.com/rust-bio/rust-bio, as it satisfies all of the requirements mentioned above.

Implementing a Python extension in rust is very easy, and a lot easier to avoid memory errors than in C/C++. Here is a proof of concept that satisfies most uses of align_it in about 30 lines of code (though without the hiv25 score table): https://github.com/jeff-k/pyrustbio/blob/master/src/lib.rs

donkirkby commented 6 years ago

rust-bio sounds interesting. Another option to consider is @ArtPoon's gotoh2 project, which is an implementation in C. He was testing it last year, but I haven't heard any recent news. It might be useful to compare our current code, gotoh2, and rust-bio for speed, memory, and correctness.

jeff-k commented 6 years ago

We also have the option to explore alternatives to local alignment. With consensus sequences that approach the size of the seeds, Gotoh's O(mn) time complexity can be expected to grow "drastically" faster than semi-global Smith-Waterman at O(min(m, n) w) for a small window w (https://docs.rs/crate/bio/0.17.0/source/src/alignment/pairwise/banded.rs).

ArtPoon commented 6 years ago

Hi @jeff-k and @donkirkby - the last I worked on gotoh2, I had shaken out a few bugs and it seems to be fairly stable. It would be great if you give it a try and flag any errors or leaks for me to fix, but I would also understand if you go with a more actively maintained project!

donkirkby commented 6 years ago

Current plan: experiment with Biopython to see how its performance compares with our current Gotoh implementation, and also look at any differences in the alignment results. I'll start with one of the large samples mentioned in #435 and look for differences when merging read pairs.