Optimize Myers1999 for large dictionary filtering

fujimotos commented 2 years ago

This is a sketch of an improvement idea I'm thinking to incorporate into polyleven.

How is polyleven used in real life?

The typical usecase of polyleven is "finding a needle in a haystack"-type tasks.
e.g. "I want to find out similar DNA sequences to this input in the 1 billion database"
So the task is essentially a filtering/search problem.

What is the optimization idea?

As the original paper described, Myers1999 requires a pre-computation of a bit-pattern table.
This precomputation is actually very expensive -- I think it takes at least 30% of the total computation budget.
The viable optimization path here is to cache the bit-pattern table.

What does the new interface look like?

I think the a map function fits very well here:

>>> polyleven.map("abcde", ["abc", "acde", ...])

How much improvement will wel see?

The best case is that we'll see 30-40% speed-up with this optimization.
IOW Polyleven will be able to process ~7 million entries per sec on a consumer-grade CPUs (such as AMD Ryzen 7).

fujimotos commented 2 years ago

Here is a proof-of-concept of this idea:

https://ceptord.net/20220515-dna100m.html

To put it short, I could archived the throughput of 14.2 million entries per seconds on my desktop machine with Ryzen 7 4700GE.

Now I need to find a way to port this impvoment to Polyleven...

maxbachmann commented 2 years ago

As a reference here are the improvements I implemented in rapidfuzz: 1) I implemented a bit more specialized hashmap: https://github.com/maxbachmann/rapidfuzz-cpp/blob/main/rapidfuzz/details/PatternMatchVector.hpp This implementation has the following advantages over the one in polyleven: a) for extended ascii, which is likely a large part of the input it comes down to a direct array access b) for long texts of ascii this allows sequential memory access when implementing the blockwise levenshtein distance as described in 2) c) for non ascii text I use a bit more advanced hash collision strategy, which is not going to burn down on input like "abcdefg" + chr(ord("a") + 128)

1) I have an inverted blockwise implementation to your. You appear to iterate:

for block in blocks:
 for ch in s2:

while I use

for ch in s2:
   for block in blocks:

as described in 1) for ascii this comes down to sequential memory accesses:

for ch in s2:
   uint64_t* block_ptr = PM.get(0, ch)
   for block in blocks:
        ...
        block_ptr++

The improvements from 1) and 2) lead to the following performance difference between rapidfuzz and polyleven: uniform_levenshtein

3) I provide a apis similar to the ones you describe here: rapidfuzz.process.exract/rapidfuzz.process.extractOne/rapidfuzz.process.cdist. In Python these provide the large advantage, that it is no longer required to call into python all the time, which saves a large percentage of the runtime for short strings.

4) as you noted this allows you to store the precomputed bit pattern table. For short strings this matters a lot, since it can take upwards 50% of the computation time. For longer strings however this becomes irrelevant.

For short texts (10 characters) the improvements from 3) + 4) lead to a performance improvement of up to 10x depending on the used similarity metric.

5) you proposed simd for large texts in #7. However SIMD is especially helpful for short strings. E.g. AVX2 allows you to calculate the similarity for 32 8 character texts in parallel. I did not have the time to implement this for Levenshtein yet, but I implemented this for the Indel distance (only insertions + deletions). Here is the pr https://github.com/maxbachmann/rapidfuzz-cpp/pull/81 with performance comparisions for c++. E.g. for 8 character texts this has the following performance on a laptop cpu from 2017 (i7 8550U): directly calculating similarity: 27 10^6 text comparisions per second caching the bitmap: 77 10^6 text comparisions per second caching the bitmap + SSE2: 1.25 10^9 text comparisions per second caching the bitmap + AVX2: 2.8 10^9 text comparisions per second

I hope this helps :) I addition in case you do not have an issue with C++ we might want to combine forces on this, since we have pretty similar implementations and I already implemented a few of the features you would like to incorporate.

fujimotos / polyleven

Optimize Myers1999 for large dictionary filtering #6