Accelerate fuzzy string matching to handle large strings and large databases - Githubissues

BenEgeIzmirli / Whittler

MIT License

7 stars 2 forks source link

Accelerate fuzzy string matching to handle large strings and large databases #10

Open BenEgeIzmirli opened 3 years ago

BenEgeIzmirli commented 3 years ago

Currently, the fuzzy string matching is done by combining the similarity estimates from the Damerau-Levenshtein and Ratcliff-Obershelp algorithms in quadrature. This turns out to be far too slow in situations where a) the two strings being compared are both fairly large, or b) where there are a large number of results in the result set.

The Ukkonen algorithm could potentially serve as an accelerated Damerau-Levenshtein replacement. https://stackoverflow.com/questions/4057513/levenshtein-distance-algorithm-better-than-onm http://www.berghel.net/publications/asm/asm.php

BenEgeIzmirli commented 3 years ago

Just learned about this "fuzzy hashing" technique that seems promising.

https://ssdeep-project.github.io/ssdeep/index.html