Open dsamarin opened 4 years ago
The test cases seem to be from here https://gist.github.com/tomstuart/9e4fd5cd96527debf7a685d0b5399635 :)
From what I am seeing, there are other more sophisticated algorithms for fuzzy substring matching. I am curious to know whether you encountered this usecase in a real-world situation ?
Yes I stumbled upon the same test cases, however I didn't bother to read the Ruby implementation. I figure it would be better to implement it by slightly modifying your algorithm implementation.
Thanks for the article, it looks like I'm using the method described there:
Computing E(m, j) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(m, j), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(i, j).
In the array containing the E(x, y) values, we then choose the minimal value in the last row, let it be E(x2, y2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(0, y1), then T[y1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P.
This is actually for a real world use case. I am attempting to match large human-typed lists of products with a smaller list of brand names. I figure I would contribute back to this project with this variant instead of forking.
As for other more sophisticated algorithms, I haven't yet looked into anything like "filter-verification, hashing, Locality-sensitive hashing (LSH), Tries or other greedy and approximation algorithms". So far your slightly modified library seems to work great in my use case.
Thanks for the explanation. Feel free to send a change with tests. I think you would need to refactor out the core logic to a helper function. Please also run the benchmarks and check if there is any regression for the function call overhead.
It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the
needle
in all possible substrings ofhaystack
. Here are some examples:All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package: