GlobalNamesArchitecture / damerau-levenshtein

Calculates edit distance using Damerau-Levenshtein algorithm
MIT License
141 stars 19 forks source link

Crash on extra long strings #26

Closed kcculhwch closed 1 year ago

kcculhwch commented 1 year ago

Hi, I'm not sure if this is still actively maintained, but I figured I would put a report in.

I'm using this gem to help me deduplicate close matches between long strings of text. I enjoy the performance of the library, which is why I've been using it.

Recently, I noticed that when I have two close but not exact strings that are quite long the code was either crashing for getting killed.

The crash was regularly reproducible when comparing two strings of over 50,000 character in length. Shorter lengths were getting killed earlier but not crashing (~30,000 chars).

The stack trace is pointing at line 52 in the c code when it crashes:

 d[i*sl] = i;

... wondering if the strings need a more constrained buffer approach somewhere?

I can give more details if you need.

Thanks, David

dimus commented 1 year ago

It is probably an integer overflow. Levenshtein algorithm also is geometrically less effective for long strings, so I think it makes sense to cut them to smaller substrings for comparison.

kcculhwch commented 1 year ago

That's what I suspected, for now I just put a simple .slice before I process it, and that seems to work pretty well for my current purposes.