dotnetbio / bio

Bioinformatics library for .NET
Apache License 2.0
143 stars 49 forks source link

Less Memory Usage / More Comments in NeedlemanWunschAligner #12

Closed evolvedmicrobe closed 8 years ago

evolvedmicrobe commented 8 years ago

For these dynamic programming problems, when traversing through the matrix one must store the entire traceback matrix, but the scores are only needed for the current row and the row above it.

In the current implementation, we only use two rows for the "top scoring" (and match) matrix, but fill out entire matrices for the deletion and insert (or vertical and horizontal) matrices. This is inefficient, as we then need 2 x M x N memory rather than 2 x 2 x N memory. I switched this to only use the smaller amount. Additionally:

evolvedmicrobe commented 8 years ago

I ran into some problems with the NW aligner using the affine aligner, and this pull request is really just about adding more tests to continue to verify the expected behavior, and while I was doing that I rewrote the algorithm to make it simpler to read and restructure a bit.

The only functional change should be not using as much memory. I changed it so that rather than have two full matrices, it only kept around the 2 rows it needed for the Gap matrices. This should mean that for an alignment of a 5669 x 5068 sequence, we will use 2 * 5068 * 4 * 2 bytes, instead of 5669 * 5068 * 4 * 2 bytes, for a savings of about ~229 MB.

Profiling this shows that we do save about this much memory and in single threaded land it is only about 7% faster, but does use less memory, which should help if a system is stressed.

Memory Time Version
360.5 20.214 Master
256.7 20.123 Master
433.6 20.25 Master
471.3 20.69 Master
451.2 20.229 Master
241.1 18.972 New
160.5 18.82 New
195.8 18.6 New
223.9 18.8 New

Further memory reduction could be had by storing the traceback matrices for the gaps as bytes instead of ints, but all of this only matters for long global alignments, which probably shouldn't be too frequent anyway.