Can we vectorlize myers1999?

fujimotos commented 2 years ago

This is a sketch of an idea on how to speed up Myers1999 further.

Observation

Myers1999 consists of a series of bitwise operations on 64-bit integers.
Intuitively, this feels very much like a "vectorizable" function.
a.k.a. Modern CPUs can perform bitwise operations on several integers in a single instruction.

Question

Can we use, say, AVX512 to improve the performance of Myers1999?
If indeed we can, this optimization will speed up the computation time as much as 4 times.

References

https://blog.cr.yp.to/20190430-vectorize.html

maxbachmann commented 2 years ago

This is possible in theory, but there are a couple of instructions which will hurt: 1) bit shifts:

HN = (HN << 1);

needs to be implemented somewhat like this:

HN = (HN << 1) | ((HN << 64) >> 63);

since simd shifts do not go across 64 bit. I think it even more struggle for AVX2 which has two separate 128 bit lanes

2) additions will not handle the carry bit accross the vector boundary:

uint64_t D0 = (((X & VP) + VP) ^ VP) | X | VN;

The problem is that this will still need to manually handle the carry bit for each part on your own.

So while this might still be faster than the serial implementation it is likely not as much of a speedup as you hope for. SIMD is mostly helpful to calculate multiple separate results in parallel as described in #6.

One different optimization for long strings is a combination of myers/hyrroe with ukkonen. This allows you to calculate the bitvectors only for the diagonal of the levenshtein matrix. So far I only implemented this for resulting bitvectors <= 64, since it was the simplest. However this can be extended for longer bitvectors. The implementation can be found here: https://github.com/maxbachmann/rapidfuzz-cpp/blob/8f0b180658663c8dc24bdca309e5ba886b5c7ede/rapidfuzz/distance/Levenshtein.impl#L216

from polyleven import levenshtein
from rapidfuzz.distance import Levenshtein
a = "b" + "aaaaaa"*100 + "b"
b = "a" + "aaaaaa"*100 + "a"

levenshtein(a, b, 31) # 1.8 us
Levenshtein.distance(a, b) # 1 us
Levenshtein.distance(a, b, score_cutoff=31) # 0.2 us

Note that even without a score_cutoff you can save up to 25% (for strings of similar length) simply because you can use max(len1, len2) as score_cutoff.

fujimotos commented 2 years ago

This is possible in theory, but there are a couple of instructions which will hurt:

When I wrote OP a few month ago, mainly I was thinking about the recent papers that ported Myers' algorithm to GPU.

Here are some examples:

Generally speaking, their performance improvements are impressive enough; For example, Lagner & Weese archived >2x improvments using CUDA (p73). I was thinking "Good! Couldn't it be ported to the CPU/SIMD world?"

Now, these papers are stacked on my desk, in the "things to study further when I have more time" box. Probably you're right that it won't work (yes, the carry bit part seems nasty indeed), but it would be really nice if there is some good use of SIMD, since virtually everyone has one nowadays.

maxbachmann commented 2 years ago

but it would be really nice if there is some good use of SIMD, since virtually everyone has one nowadays.

For shorter sequences I use SIMD to match multiple sequences to a single sequence. This works pretty well for sequences with a length <= 64, since you do not need to take carry bits into account. So e.g. for AVX2 you can match 32 8 character strings in parallel. The only thing which needs to be changed slightly is the bitcounter:

currDist += bool(HP & mask);
currDist -= bool(HN & mask);

which would need to be changed to something along the lines of

currDist += ((HP & mask) != 0) & 1;
currDist -= ((HN & mask) != 0) & 1;

You could probably apply this approach to longer sequences, where you can match 4 longer sequences in parallel similar to the regular approach. However this makes the count of the distance more complicated, since the bit to test will not be in the same block for every sequence. Thinking about it the following should be a good solution (might need some more handling in case the lengths vary to a great extent). Generally it should try to match strings of mostly equal length together, so it does not waste space in the bitvectors.

vector<avx2_vector> mask_vec;
index block = 0;
for(block < first_result_block; block++)
    myers()

for(; block < blocks; block++)
    myers()
    currDist += ((HP & mask_vec[block-first_result_block]) != 0) & 1;
    currDist -= ((HN & mask_vec[block-first_result_block]) != 0) & 1;

When matching multiple longer sequences, this should give around a 4x speedup.

fujimotos / polyleven

Can we vectorlize myers1999? #7