Less Memory Usage / More Comments in NeedlemanWunschAligner

dotnetbio / bio

Bioinformatics library for .NET

Apache License 2.0

145 stars 49 forks source link

For these dynamic programming problems, when traversing through the matrix one must store the entire traceback matrix, but the scores are only needed for the current row and the row above it.

In the current implementation, we only use two rows for the "top scoring" (and match) matrix, but fill out entire matrices for the deletion and insert (or vertical and horizontal) matrices. This is inefficient, as we then need 2 x M x N memory rather than 2 x 2 x N memory. I switched this to only use the smaller amount. Additionally:

Changed the minimum score on the edges from an arbitray low number to a constant near Int32.MinValue for clarity.
Added more comments throughout.
Rather than have three arrays, I used 1 array of a struct which may help memory locality for very large global alignments.
Added more unit tests that directly compare against EMBOSS Needle alignment program, found and reported a bug in NEEDLE while doing that...
Change all the vague category "Priority2" labels in NW tests to a newer version.

I ran into some problems with the NW aligner using the affine aligner, and this pull request is really just about adding more tests to continue to verify the expected behavior, and while I was doing that I rewrote the algorithm to make it simpler to read and restructure a bit.

The only functional change should be not using as much memory. I changed it so that rather than have two full matrices, it only kept around the 2 rows it needed for the Gap matrices. This should mean that for an alignment of a 5669 x 5068 sequence, we will use 2 * 5068 * 4 * 2 bytes, instead of 5669 * 5068 * 4 * 2 bytes, for a savings of about ~229 MB.

Profiling this shows that we do save about this much memory and in single threaded land it is only about 7% faster, but does use less memory, which should help if a system is stressed.

Memory	Time	Version
360.5	20.214	Master
256.7	20.123	Master
433.6	20.25	Master
471.3	20.69	Master
451.2	20.229	Master
241.1	18.972	New
160.5	18.82	New
195.8	18.6	New
223.9	18.8	New

Further memory reduction could be had by storing the traceback matrices for the gaps as bytes instead of ints, but all of this only matters for long global alignments, which probably shouldn't be too frequent anyway.

dotnetbio / bio

Less Memory Usage / More Comments in NeedlemanWunschAligner #12