barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

implement edit distance algo which cuts of at max distance #675

Open barzerman opened 10 years ago

barzerman commented 10 years ago

currently we use standard levenshtein code which computes edit distance between two strings. edit distance is O(MN) where M and N are lengths of the strings. Which means that whenever the strings are far apart levenshtein takes a while to compute.

However, we almost always know - we only care to see the actual value of the distance if it's smaller than some relatively small value (in our case it's 4, in other words if strings are more than 3 apart we don't want to know the exact distance).

So, we need a version of levenshtein algo which has the following interface:

lev_dist( const char* s1, size_t s1_len, const char* s2, size_t s2_len, size_t maxDist )

the function will return maxDist if

|s1,s2| >= maxDist 

and the true distance otherwise. This is important for both BENI and the spell checker

@inggris - this is one of the problems where dynamic programming can be applied . @pltr

if either of you is into algorithms and wants to do this - let me know - we can work on it together.

barzerman commented 10 years ago

done some optimizations - locally. the cutoff in the dynamic programming algorithm speeds it up by a factor of 2 (confirmed)

a few other heuristics - such as stripping off the longest shared common affixes speeds it up by leaps and bounds.

in a practical case when maximum acceptable levenshtein distance is much smaller than the word length, the algorithm can be made asymptotically linear as opposed to the generic O(n^2).

Also, a major optimization for utf8 can be performed - currently the code computes i-th characters in a loop, and this is O(n) for utf8 strings. Can be changed to O(1)

barzerman commented 10 years ago

the final version of the template, which does all optimizations described above for all three classes of strings (ascii, 2 byte and utf8) is coded. Also, with major optimization for utf8 comparisons - this should make utf8 processing speed more or less equivalent to that of the other two. Also, before only ascii had any optimizations.

the code is at: it may not compile yet. will finalize/test tomorrow

eu.barzer.net:/home/yanis/fun/levenshteyn.cpp