fujimotos / polyleven

Fast Levenshtein Distance Library for Python 3
https://ceptord.net
MIT License
81 stars 11 forks source link

Optimize Myers1999 for large dictionary filtering #6

Open fujimotos opened 2 years ago

fujimotos commented 2 years ago

This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.

How is polyleven used in real life?

What is the optimization idea?

What does the new interface look like?

How much improvement will wel see?

fujimotos commented 2 years ago

Here is a proof-of-concept of this idea:

To put it short, I could archived the throughput of 14.2 million entries per seconds on my desktop machine with Ryzen 7 4700GE.

Now I need to find a way to port this impvoment to Polyleven...

maxbachmann commented 2 years ago

As a reference here are the improvements I implemented in rapidfuzz: 1) I implemented a bit more specialized hashmap: https://github.com/maxbachmann/rapidfuzz-cpp/blob/main/rapidfuzz/details/PatternMatchVector.hpp This implementation has the following advantages over the one in polyleven: a) for extended ascii, which is likely a large part of the input it comes down to a direct array access b) for long texts of ascii this allows sequential memory access when implementing the blockwise levenshtein distance as described in 2) c) for non ascii text I use a bit more advanced hash collision strategy, which is not going to burn down on input like "abcdefg" + chr(ord("a") + 128)

1) I have an inverted blockwise implementation to your. You appear to iterate:

for block in blocks:
 for ch in s2:

while I use

for ch in s2:
   for block in blocks:

as described in 1) for ascii this comes down to sequential memory accesses:

for ch in s2:
   uint64_t* block_ptr = PM.get(0, ch)
   for block in blocks:
        ...
        block_ptr++

The improvements from 1) and 2) lead to the following performance difference between rapidfuzz and polyleven: uniform_levenshtein

3) I provide a apis similar to the ones you describe here: rapidfuzz.process.exract/rapidfuzz.process.extractOne/rapidfuzz.process.cdist. In Python these provide the large advantage, that it is no longer required to call into python all the time, which saves a large percentage of the runtime for short strings.

4) as you noted this allows you to store the precomputed bit pattern table. For short strings this matters a lot, since it can take upwards 50% of the computation time. For longer strings however this becomes irrelevant.

For short texts (10 characters) the improvements from 3) + 4) lead to a performance improvement of up to 10x depending on the used similarity metric.

5) you proposed simd for large texts in #7. However SIMD is especially helpful for short strings. E.g. AVX2 allows you to calculate the similarity for 32 8 character texts in parallel. I did not have the time to implement this for Levenshtein yet, but I implemented this for the Indel distance (only insertions + deletions). Here is the pr https://github.com/maxbachmann/rapidfuzz-cpp/pull/81 with performance comparisions for c++. E.g. for 8 character texts this has the following performance on a laptop cpu from 2017 (i7 8550U): directly calculating similarity: 27 10^6 text comparisions per second caching the bitmap: 77 10^6 text comparisions per second caching the bitmap + SSE2: 1.25 10^9 text comparisions per second caching the bitmap + AVX2: 2.8 10^9 text comparisions per second

I hope this helps :) I addition in case you do not have an issue with C++ we might want to combine forces on this, since we have pretty similar implementations and I already implemented a few of the features you would like to incorporate.