Open barzerman opened 10 years ago
done some optimizations - locally. the cutoff in the dynamic programming algorithm speeds it up by a factor of 2 (confirmed)
a few other heuristics - such as stripping off the longest shared common affixes speeds it up by leaps and bounds.
in a practical case when maximum acceptable levenshtein distance is much smaller than the word length, the algorithm can be made asymptotically linear as opposed to the generic O(n^2).
Also, a major optimization for utf8 can be performed - currently the code computes i-th characters in a loop, and this is O(n) for utf8 strings. Can be changed to O(1)
the final version of the template, which does all optimizations described above for all three classes of strings (ascii, 2 byte and utf8) is coded. Also, with major optimization for utf8 comparisons - this should make utf8 processing speed more or less equivalent to that of the other two. Also, before only ascii had any optimizations.
the code is at: it may not compile yet. will finalize/test tomorrow
eu.barzer.net:/home/yanis/fun/levenshteyn.cpp
currently we use standard levenshtein code which computes edit distance between two strings. edit distance is O(MN) where M and N are lengths of the strings. Which means that whenever the strings are far apart levenshtein takes a while to compute.
However, we almost always know - we only care to see the actual value of the distance if it's smaller than some relatively small value (in our case it's 4, in other words if strings are more than 3 apart we don't want to know the exact distance).
So, we need a version of levenshtein algo which has the following interface:
the function will return maxDist if
and the true distance otherwise. This is important for both BENI and the spell checker
@inggris - this is one of the problems where dynamic programming can be applied . @pltr
if either of you is into algorithms and wants to do this - let me know - we can work on it together.