Open RagnarGrootKoerkamp opened 1 year ago
I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time :+1:
After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:
The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.
I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time +1
After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:
- edlib_old: 21.7s
- edlib_new: 16.6s
- rapidfuzz: 16.9s
The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.
Ok this will be great then!
Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.
Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.
I personally use a couple implementations for the standard Levenshtein distance:
if the allowed edits are <= 3 use a brute force approach testing all possible edit combination
if sequence length < 64 use hyrroes/myers algorithm, but without blocks
if the band size is < 64 use an implementation calculating the bit vectors on the fly
if both length and band size is greater use the blockwise implementation
I provide a way to cache the bitvectors. While the creation of bitvectors is relatively cheap to create for long sequences, this can mean a significant speedup when comparing multiple shorter sequences.
when using short sequences < 64 characters I provide a simd implementation using sse2 and avx2 to compare multiple sequences in parallel
All of this is done inside the library, as long as the user calls it in an appropriate way. E.g. I can only use the simd implementation if the user calls the library with multiple strings.
This encodes horizontal input/output deltas using
Phin
andMhin
indicator words with the lowest bit set ifhin
is+1
or-1
respectively. This gives up to20%
speed gains on large inputs where the bitpacking block computation takes relatively longer than the accounting for band-size.Some small alignments get up to 2% slower but I suspect this is just noise.![image](https://user-images.githubusercontent.com/6233682/230492335-afa6ff0d-b4e2-44aa-9e7e-78851c779070.png)