agnivade / levenshtein

Go implementation to calculate Levenshtein Distance.
MIT License
344 stars 27 forks source link

Support Levenshtein substring matching #12

Open dsamarin opened 4 years ago

dsamarin commented 4 years ago

It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the needle in all possible substrings of haystack. Here are some examples:

tests := []struct {
        needle, haystack string
        want int
    }{
        {"Machu Picchu", "Machu Picchu", 0},
        {"Machu Picchu", "Where is Machu Picchu?", 0},
        {"Machu Picchu", "Where is Machu Pichu?", 1},
        {"Machu Picchu", "Where is Macchu Picchu?", 1},
        {"Stonehenge", "How do I get to Stone Henge?", 2},
        {"", "", 0},
        {"", "Machu Picchu", 0},
        {"u", "Machu Picchu", 0},
        {"z", "Machu Picchu", 1},
        {"Uluru", "", 5},
    }

All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package:

func ComputeSubstringDistance(needle, haystack string) int {
    if len(needle) == 0 {
        return 0
    }

    if len(haystack) == 0 {
        return utf8.RuneCountInString(needle)
    }

    if needle == haystack {
        return 0
    }

    // We need to convert to []rune if the strings are non-ascii.
    // This could be avoided by using utf8.RuneCountInString
    // and then doing some juggling with rune indices.
    // The primary challenge is keeping track of the previous rune.
    // With needle range loop, its not that easy. And with needle for-loop
    // we need to keep track of the inter-rune width using utf8.DecodeRuneInString
    s1 := []rune(haystack)
    s2 := []rune(needle)

    lenS1 := len(s1)
    lenS2 := len(s2)

    // init the row
    x := make([]int, lenS1+1)

    // make needle dummy bounds check to prevent the 2 bounds check down below.
    // The one inside the loop is particularly costly.
    _ = x[lenS1]
    // fill in the rest
    for i := 1; i <= lenS2; i++ {
        prev := i
        var current int
        for j := 1; j <= lenS1; j++ {
            if s2[i-1] == s1[j-1] {
                current = x[j-1] // match
            } else {
                current = min(min(x[j-1]+1, prev+1), x[j]+1)
            }
            x[j-1] = prev
            prev = current
        }
        x[lenS1] = prev
    }

    min := x[0]
    for i := 1; i <= lenS1; i++ {
        if x[i] < min {
            min = x[i]
        }
    }
    return min
}
agnivade commented 4 years ago

The test cases seem to be from here https://gist.github.com/tomstuart/9e4fd5cd96527debf7a685d0b5399635 :)

From what I am seeing, there are other more sophisticated algorithms for fuzzy substring matching. I am curious to know whether you encountered this usecase in a real-world situation ?

dsamarin commented 4 years ago

Yes I stumbled upon the same test cases, however I didn't bother to read the Ruby implementation. I figure it would be better to implement it by slightly modifying your algorithm implementation.

Thanks for the article, it looks like I'm using the method described there:

Computing E(m, j) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(m, j), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(i, j).

In the array containing the E(x, y) values, we then choose the minimal value in the last row, let it be E(x2, y2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(0, y1), then T[y1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P.

This is actually for a real world use case. I am attempting to match large human-typed lists of products with a smaller list of brand names. I figure I would contribute back to this project with this variant instead of forking.

As for other more sophisticated algorithms, I haven't yet looked into anything like "filter-verification, hashing, Locality-sensitive hashing (LSH), Tries or other greedy and approximation algorithms". So far your slightly modified library seems to work great in my use case.

agnivade commented 4 years ago

Thanks for the explanation. Feel free to send a change with tests. I think you would need to refactor out the core logic to a helper function. Please also run the benchmarks and check if there is any regression for the function call overhead.