GlobalNamesArchitecture / damerau-levenshtein

Calculates edit distance using Damerau-Levenshtein algorithm
MIT License
135 stars 19 forks source link

Error calculating damerau-levenshtein distance for large strings #13

Closed grudzien closed 6 years ago

grudzien commented 6 years ago

I was just looking at this module and I noticed that for larger strings there appears to be an issue calculating the dl distance. For example:

DamerauLevenshtein.distance("aaliyahmarie", "rvolnwkxcqefqbthwxfrskwchnmqoibgbvosrdyyuswdyqtiez") => 11

This is incorrect as shows on: http://fuzzy-string.com/Compare/Transform.aspx?r=aaliyahmarie&q=rvolnwkxcqefqbthwxfrskwchnmqoibgbvosrdyyuswdyqtiez

It's supposed to be 44.

Here are some other examples: DamerauLevenshtein.distance("aaliyahmarie", "mfhdfvbfykinhnhkhcubdjjqgjdcaazjvuijpdodsulveprmmh") => 11 (should be 45) DamerauLevenshtein.distance("aaliyahmarie", "hxswuwpeahyulhwnjxsdtopawjqbcytjkrszhfonapovxyyazr") => 11 (should be 45)

Here is an example where "aaliyahmarie" doesn't break at long strings: DamerauLevenshtein.distance("aaliyahmarie", "ukcmvhgxewuliefeinkftyjbqgvymzzzqmlzeuwjjquxvrtjic") => 44 (Correct)

Another example with a different first string: DamerauLevenshtein.distance("constantine", "qzmwsgylrorhimtpsmbesmzerbkiteyhityhkmimtvysdhuize") => 11 (Should be 44)

Not sure why all of the wrong answers are 11, maybe it is a coincidence from the string size I picked for the second string.

I was able to replicate this bug across multiple machines (Both Mac)

grudzien commented 6 years ago

After a bit of debugging it is hitting the max_distance check in the C code. I am trying to figure out why.

-- edit I tried to set max_distance manually to 15 (it is usually set to 10). When I set it to 15 all checks complete correctly. I am currently investigating why max_distance gets set to 10 and why current_distance goes to 11 on my examples, causing the code to return erroneously.

-- edit my fault. I didn't read the docs closely enough.