Turnerj / Quickenshtein

Making the quickest and most memory efficient implementation of Levenshtein Distance with SIMD and Threading support
MIT License
284 stars 14 forks source link

Improving the Diagonal Calculation #21

Open Turnerj opened 4 years ago

Turnerj commented 4 years ago

A few points:

Preliminary calculations on performance for single threaded improvement (take these with a grain of salt, they all don't calculate correctly):

No target lookups and no shuffles/permutations aka. Process Column (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates) Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 23.67 ms 0.125 ms 0.117 ms - - - -
No source lookups aka. Process Row (at best - we would still need one of these calls per vector but we would avoid a lot of duplicates) Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 24.50 ms 0.143 ms 0.134 ms - - - 40 B
No source or target lookups (INVALID) Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Quickenshtein 21.20 ms 0.138 ms 0.130 ms - - - 40 B

For comparison, it takes about 26.8 ms to calculate it currently with the regular diagonal calculation. This means at best, single threaded could have a 12% improvement however it is likely to be closer to 8% for the "correct" solution.

No idea about how the threading code would perform though.

RobSchoenaker commented 3 years ago

Not sure if I am talking rubbish, but you may want to consider the single row version https://rosettacode.org/wiki/Levenshtein_distance#C.2B.2B It's discussed here: https://www.youtube.com/watch?v=Cu7Tl7FGigQ

Turnerj commented 3 years ago

Hey @RobSchoenaker - thanks for the link however Quickenshtein already uses that optimization and a bunch more. If you're curious about some of the exotic optimizations, you can check out this video of me talking about it at dotNETConf last year (https://www.youtube.com/watch?v=JiOYajl2Mds) or any of my blog posts on the subject:

This particular issue is referring to a diagonal calculation approach which allows us to take advantage of SIMD (watch my video or that last blog post if you want to know a bit about SIMD). Basically diagonally calculating allows us to remove the dependency issue when calculating multiple cells at once. We do require 2x the memory that the traditional single row needs but for long strings, we absolutely crush it in speed.

I do plan to have a few more blog posts about Levenshtein Distance in the future too as there are definitely some cool things I'd like to check out with it.

RobSchoenaker commented 3 years ago

@Turnerj thx for the quick response. I see my comment did not make any sense :-) Will check out the dotNETConf things!

Turnerj commented 3 years ago

No problem mate! Hopefully you find some of the links I've given interesting.