bitextor / bleualign-cpp

GNU General Public License v3.0
7 stars 2 forks source link

Bleualign scales badly #9

Open jelmervdl opened 4 years ago

jelmervdl commented 4 years ago

If you give bleualign large input, it will just run forever. It can hold up the pipeline. Would be great if we either find a way to make it perform better for large document pairs, or have an option for it to bail if it knows it's not going to work out in a reasonable amount of time.

Really this issue is here so I can track/brain dump my findings & attach this trouble.gz file for reference. This is what stalled some bleualign processes for a couple of hours. It contains 174741 vs 175306 lines. Luckily that's not too typical for Paracrawl, but it happens.

Just ran the file on my local machine: it takes 50 minutes, and uses up to 28gb of memory. (And Apple Instruments can't export…, so here's a screenshot)

Screenshot 2020-08-18 at 18 30 22

Most time & memory is spent in search.cpp. I suspect the memory is due to the MN back_pointer matrix. Most time is spend in boost's find_node, [called NM times in the loop](https://github.com/bitextor/bleualign-cpp/blob/master/src/search.cpp#L74).

jelmervdl commented 4 years ago

https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm ?

mlforcada commented 4 years ago

On a quick reading, I fail to see any fundamental difference with the regular Wagner–Fischer implementation of Levenshtein edit distance. In fact, Hirschberg's algorithm seems to be a variation of the Needleman–Wunsch algorithm, which in turn looks a lot like a clever trick on top of Levenshtein edit distance to store only two rows or two columns (whichever is smallest). Time complexities are O(nm) for all three algorithms (unless one assumes that alignments fall on a diagonal strip, in which case the algorithm is O(m) or O(n), whichever is smaller):

El 19/8/20 a les 18:43, Jelmer ha escrit:

https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bitextor/bleualign-cpp/issues/9#issuecomment-676537229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABXSX5OY6ORAXKKPPZ7SQ3SBP6J5ANCNFSM4QDW2VMA.

-- Mikel L. Forcada http://www.dlsi.ua.es/~mlf/ Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03690 Sant Vicent del Raspeig Spain Office: +34 96 590 9776

jelmervdl commented 4 years ago

Indeed, it doesn't change the time complexity, but it might help with getting the runtime on beefier machines down by using them a bit more optimally. I'm running into this on CSD3, where I can run a couple of bleualign in parallel using GNU parallel, but am still limited by the runtime and memory usage of each individual one.

I think it comes down to this: My two main issues with the current algorithm for searching for alignments is that it cannot scale by throwing more hardware at it: the memory issue is exponential, and the runtime issue is that it's single threaded.

The memory issue is due to the back pointer array that's used to find the shortest "edit distance", the one filled by process() and read by extract_matches(). As far as I understand Hirschberg, there is no back pointer matrix necessary as it is more of a divide & conquer approach, remembering the shortest path for sections of the search space.

The runtime issue I cannot solve because it's in that loop that fills the back pointer matrix, each cell depends on its previously calculated neighbours. Hirschberg is again splitting each task into subtasks that don't depend on each other so that would be much more simple to compute those parallel.