Entomy / LibLangly

The combined Langly runtime
https://entomy.github.io/LibLangly/
33 stars 7 forks source link

FuzzyEquals should be using a better edit-distance algorithm #153

Open Entomy opened 4 years ago

Entomy commented 4 years ago

Currently, FuzzyEquals makes use of the Levenshtein edit-distance algorithm. This counts substitutions, insertions, and deletions, but not transpositions. FuzzyEquals() should take into consideration all four edits.

Entomy commented 4 years ago

Damerau-Levenshtein edit-distance is probably the best compromise between accuracy and performance for general purpose use (other edit-distance algorithms should still be implemented). However there's two pitfalls when finding a good implementation for this. As a result, lifting an existing implementation isn't likely.

  1. The overwhelming majority of implementations actually implement Optimal String Alignment, which doesn't return the same results. OSA is not the same algorithm, although it is very similar.
  2. The few that implement DL implement it incorrectly. I've emailed the authors about this, and have received no response within a week, so I'm not particularly hopeful about that getting fixed.

So pretty par-for-the-norm with Comp. Sci. stuff.

Entomy commented 4 years ago

These may be useful as far as a good implementation goes.

Hyyrö, H. A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances. Nord. J. Comput. 2002.

Balhaf, K. Alsmirat, M. et al. Accelerating Levenshtein and Damerau edit distance algorithms using GPU with unified memory. ICICS. 2017.

Entomy commented 4 years ago

Damerau-Levenshtein request is here Stringier/Metrics#2

Entomy commented 4 years ago

I mistakenly marked this as a good issue for contributors. Implementation of edit-distance algorithms is a good first issue. However, choosing which one provides the default FuzzyEquals behavior is my decision, and will be based on analysis of several alternatives overall performance curves and capabilities.