GlobalNamesArchitecture / damerau-levenshtein

Calculates edit distance using Damerau-Levenshtein algorithm
MIT License
144 stars 19 forks source link

Fix the case where the index of d becomes negative. #5

Closed Skarlit closed 8 years ago

Skarlit commented 8 years ago

see https://wiki.csiro.au/display/taxamatch/The+MDLD+(Modified+Damerau-Levenshtein+Distance)+Algorithm

Skarlit commented 8 years ago
if (pure_levenshtein == 0 && i >= block && j >= block && swap1 == 1 && swap2 == 1){
   transp = d[(j - block * 2) * sl + i - block * 2] + cost + block -1;
    if (transp < min) min = transp;
    block = 0;
} else if (block == 1) {
   subs = d[(j-1)*sl + i - 1] + cost;
    if (subs < min) min = subs;
}

When i = block and j = block, d[(j - block * 2) * sl + i - block * 2] = d[ - block * sl - block] in particular, when i = j = block = 1, it gives huge negative number.

dimus commented 8 years ago

Can you add test to spec/files/damerau_levenshtein_test.txt which illustrates the problem?

Skarlit commented 8 years ago

Sorry for the late reply. I am not able to come up with an example currently because the bug happened randomly but consistently when I scraped a table with 170k entries with two calls to the methods under a loop. The huge negative numbers occur randomly with different entries everytime (when you test those entries individually it works fine). We logged out the indices and they occasionally turned negative. After the fix, everything works fine. I will follow up in the future if I can reproduce the problem with small amount of data.

dimus commented 8 years ago

Thanks Ran, 'random' problems are the hardest to test, but also very important to test. If you can send me a file that generates such a problem, I will try to figure out how to make a test with smallest amount of data.

jdmoody commented 8 years ago

I am able to reproduce the issue using Ruby version 2.1.2 with the following code snippet:

require 'damerau-levenshtein'

100000.times do
  distance = DamerauLevenshtein.distance('aaaa', 'aaaa', 1, 2)
  puts "Negative Distance: #{distance}" if distance < 0
end

Oddly enough, it doesn't return negative distances in Ruby version 1.9.2.

Skarlit commented 8 years ago

The following data will have one random instance of large negative number under ruby 1.9.2. Bug.zip

dimus commented 8 years ago

Added to master and created 1.1.1 version with this change