Potential performance improvement

Nick-Mazuk commented 3 years ago

I believe we can speed up polyleven when it uses the mbleven algorithm quite a bit. In my testing on a Rust version I created, it can lead to a 6x performance improvement in some of the slower cases by allowing for early short-circuits. It all comes down to this while loop inside the mbleven_ascii function.

 while (MBLEVEN_MATRIX[pos]) {
        m = MBLEVEN_MATRIX[pos++];
        i = j = c = 0;
        while (i < len1 && j < len2) {
            if (s1[i] != s2[j]) {
                c++;
                if (!m) break;
                if (m & 1) i++;
                if (m & 2) j++;
                m >>= 2;
            } else {
                i++;
                j++;
            }
        }
        c += (len1 - i) + (len2 - j);
        r = MIN(r, c);
        // add improvements here
    }

At the end of the outer while loop, we can short circuit if:

r == 0: This means both words are identical. We can short circuit and return immediately. It's impossible for the edit distance to be negative, so once r == 0, we know we've reached the best case. This will always occur on the first iteration of the loop, meaning we prevent many excess iterations of the loop (e.g., if k = 3 and d = 0, where d is the difference in the length of the words, we save 6 unneeded iterations).
r == 1: If this happens, we know for a fact that the words are not identical. If they were, we'd get r == 0. Since r == 1 is the smallest number of possible edits when the words are not equal, we can short-circuit early and return 1 immediately.

Hence, you can just add

if (r < 2) {
    return r;
}

This should apply to both the ASCII and strbuf cases.

fujimotos commented 2 years ago

@Nick-Mazuk That's great. Can you post a PR so that I can merge your contribution?

Once done, I will add you as an author to the LICENSE file:

https://github.com/fujimotos/polyleven/blob/master/LICENSE

Nick-Mazuk commented 2 years ago

@fujimotos, PR created.

fujimotos commented 2 years ago

This issue has been merged in master.

fujimotos / polyleven

Potential performance improvement #4