I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this while loop inside the mbleven_ascii function.
while (MBLEVEN_MATRIX[pos]) {
m = MBLEVEN_MATRIX[pos++];
i = j = c = 0;
while (i < len1 && j < len2) {
if (s1[i] != s2[j]) {
c++;
if (!m) break;
if (m & 1) i++;
if (m & 2) j++;
m >>= 2;
} else {
i++;
j++;
}
}
c += (len1 - i) + (len2 - j);
r = MIN(r, c);
// add improvements here
}
At the end of the outer while loop, we can short circuit if:
r == 0: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so once r == 0, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., if k = 3 and d = 0, where d is the difference in the length of the words, we save 6 unneeded iterations).
r == 1: If this happens, we know for a fact that the words are not identical. If they were, we'd get r == 0. Since r == 1 is the smallest number of possible edits when the words are not equal, we can short-circuit early and return 1 immediately.
Hence, you can just add
if (r < 2) {
return r;
}
This should apply to both the ASCII and strbuf cases.
I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this
while
loop inside thembleven_ascii
function.At the end of the outer
while
loop, we can short circuit if:r == 0
: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so oncer == 0
, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., ifk = 3
andd = 0
, where d is the difference in the length of the words, we save 6 unneeded iterations).r == 1
: If this happens, we know for a fact that the words are not identical. If they were, we'd getr == 0
. Sincer == 1
is the smallest number of possible edits when the words are not equal, we can short-circuit early andreturn 1
immediately.Hence, you can just add
This should apply to both the ASCII and
strbuf
cases.