Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
492 stars 162 forks source link

Encode horizontal deltas using +1/-1 indicators for up to 20% speed gain #214

Open RagnarGrootKoerkamp opened 1 year ago

RagnarGrootKoerkamp commented 1 year ago

This encodes horizontal input/output deltas using Phin and Mhin indicator words with the lowest bit set if hin is +1 or -1 respectively. This gives up to 20% speed gains on large inputs where the bitpacking block computation takes relatively longer than the accounting for band-size.

Some small alignments get up to 2% slower but I suspect this is just noise. image

maxbachmann commented 1 year ago

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time :+1:

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Martinsos commented 1 year ago

I was wondering why my own implementation (https://github.com/maxbachmann/RapidFuzz) was faster for very dissimilar sequences at some point. And indeed similar to this PR I store the deltas separately, so I do not have to split them. I was unaware this saved me so much time +1

After this PR they are pretty much have the same performance. E.g. for completely different strings with length 1m I get:

  • edlib_old: 21.7s
  • edlib_new: 16.6s
  • rapidfuzz: 16.9s

The implementation for small sequences in edlib is far from optimal anyways, since in this case the whole accounting for the band size becomes way to expensive.

Ok this will be great then!

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

maxbachmann commented 1 year ago

Yup, for smaller sequences the housekeeping becomes too much. I think the best approach is to use different algorithm (like LandauVishkin) for smaller sequences. Idea was to put this in Edlib and have Edlib choose the algorithm based on the sequence length, but haven't implemented that yet. It is a bit more work because it needs to support everything the current algorithm does. But that seems like the most obvious way to go to me.

I personally use a couple implementations for the standard Levenshtein distance:

All of this is done inside the library, as long as the user calls it in an appropriate way. E.g. I can only use the simd implementation if the user calls the library with multiple strings.