EthanRutherford / fast-fuzzy

Fast fuzzy search utility
ISC License
376 stars 8 forks source link

Sellers Glitch #16

Closed ObliviousDonkey closed 2 years ago

ObliviousDonkey commented 3 years ago

search('foooooo bar foo', ['bar'])

Doesn't return anything. What's wrong with Sellers? But if I do search('bar', ['foooooo bar foo']), it works. Please fix it.

EthanRutherford commented 3 years ago

What's happening is that your search term (the first argument) is much much longer than your single search candidate (the second argument). The Sellers flavor of levenshtein distance tries to find the substring of the candidate with the closest match to the (whole) search term. This means that if the search term is significantly longer than a candidate, the score will be really low.

As you've noted, the order is important: by swapping the term and the candidate, you get virtually the opposite match score. bar is an exact substring of foooooo bar foo, so it receives a 100% match. but foooooo bar foo is not a substring of bar. it gets a 20% match, because three of the fifteen characters match. Using the default options, search will only return matches which score at least a 60% match. You can pass in 0 to get all of the results.

A non-sellers search would get a 20% match regardless of which one is the term and which is the candidate, because it's a whole string match, rather than a substring match.

There are some links included in the readme that you may click through if you'd like to learn more. https://en.wikipedia.org/wiki/Levenshtein_distance https://www.semanticscholar.org/paper/The-Theory-and-Computation-of-Evolutionary-Pattern-Sellers/0517aa6d420f66f74bd4b281e2ed0e2021f3d359?p2df

ObliviousDonkey commented 3 years ago

But is it possible to get the result I expected, by modifying the search function? And how? If term is bigger than candidate, it'll search substring candidate in term.

On Sat, Nov 20, 2021 at 10:23 AM Ethan Rutherford @.***> wrote:

What's happening is that your search term (the first argument) is much much longer than your single search candidate. The Sellers flavor of levenshtein distance tries to find the substring of the candidate with the closest match to the (whole) search term. This means that if the search term is significantly longer than a candidate, the score will be really low.

As you've noted, the order is important: by swapping the term and the candidate, you get virtually the opposite match score. bar is an exact substring of foooooo bar foo, so it receives a 100% match. but foooooo bar foo not a substring of bar. it gets a roughly 20% match, because three of the fifteen characters match. Using the default options, search will only return matches which score at least a 60% match. You can pass in 0 to get all of the results

A non-sellers search would get a 20% match regardless of which one is the term and which is the candidate, because it's a whole string match, rather than a substring match.

There are some links included in the readme that you may click through if you'd like to learn more. https://en.wikipedia.org/wiki/Levenshtein_distance

https://www.semanticscholar.org/paper/The-Theory-and-Computation-of-Evolutionary-Pattern-Sellers/0517aa6d420f66f74bd4b281e2ed0e2021f3d359?p2df

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EthanRutherford/fast-fuzzy/issues/16#issuecomment-974591033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN6VMPH7NN3ARBVV2J6QJFTUM4PCZANCNFSM5IMYIFCQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ObliviousDonkey commented 3 years ago

Just need to know if it's possible in a search trie or else I've to use traditional loop.

On Sat, 20 Nov 2021, 11:14 am Oblivious Donkey, @.***> wrote:

But is it possible to get the result I expected, by modifying the search function? And how? If candidate is bigger, it will swap values with term.

On Sat, Nov 20, 2021 at 10:23 AM Ethan Rutherford < @.***> wrote:

What's happening is that your search term (the first argument) is much much longer than your single search candidate. The Sellers flavor of levenshtein distance tries to find the substring of the candidate with the closest match to the (whole) search term. This means that if the search term is significantly longer than a candidate, the score will be really low.

As you've noted, the order is important: by swapping the term and the candidate, you get virtually the opposite match score. bar is an exact substring of foooooo bar foo, so it receives a 100% match. but foooooo bar foo not a substring of bar. it gets a roughly 20% match, because three of the fifteen characters match. Using the default options, search will only return matches which score at least a 60% match. You can pass in 0 to get all of the results

A non-sellers search would get a 20% match regardless of which one is the term and which is the candidate, because it's a whole string match, rather than a substring match.

There are some links included in the readme that you may click through if you'd like to learn more. https://en.wikipedia.org/wiki/Levenshtein_distance

https://www.semanticscholar.org/paper/The-Theory-and-Computation-of-Evolutionary-Pattern-Sellers/0517aa6d420f66f74bd4b281e2ed0e2021f3d359?p2df

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EthanRutherford/fast-fuzzy/issues/16#issuecomment-974591033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN6VMPH7NN3ARBVV2J6QJFTUM4PCZANCNFSM5IMYIFCQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

EthanRutherford commented 2 years ago

Sorry for the late reply, but I'm curious what your intended use case is. The use case in mind is something like an autocomplete/autocorrect: typing "Tim" will find "Timothy" at the top of the list, but "Jim" would be next in the list, etc.

This may be an interesting new use-case to explore, but the existing behavior is working as intended, and your suggestion wouldn't totally fit in the existing use case. For instance, if a user typed "Austim" (typo), they'd probably be a bit confused if "Tim" was returned as a better match than "Austin".

So if I may ask, what did you have in mind?

ObliviousDonkey commented 2 years ago

I'm utilizing the module with nlp.js package for getting levenshtein distance between two sentences search('so stupid that you suck at maths, now stop wasting my time', ['you suck at maths', 'foo bar foo', 'fooo', 'asduijakdsasdijkdasnkjsdanijaskdsaidsjlkdaskj'])

And the score will be 1.0 for you suck at maths. Add this feature if possible. Don't make any breaking changes, add a new option in options parameter for this use-case. And I can filter the your "Austim" and "Tim" thing using match.index

On Tue, Nov 23, 2021 at 12:22 AM Ethan Rutherford @.***> wrote:

Sorry for the late reply, but I'm curious what your intended use case is. The use case in mind is something like an autocomplete/autocorrect: typing "Tim" will find "Timothy" at the top of the list, but "Jim" would be next in the list, etc.

This may be an interesting new use-case to explore, but the existing behavior is working as intended, and your suggestion wouldn't totally fit in the existing use case. For instance, if a user typed "Austim" (typo), they'd probably be a bit confused if "Tim" was returned as a better match than "Austin".

So if I may ask, what did you have in mind?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EthanRutherford/fast-fuzzy/issues/16#issuecomment-975799119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN6VMPGVUI2HFNR4LVBN353UNKC5PANCNFSM5IMYIFCQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

EthanRutherford commented 2 years ago

I've been thinking about how I might do this, but I'm not sure if I'd be able to make it performant. The search trie is able to accelerate queries by first eliminating duplicate work done on common prefixes, and second by using the threshold to exit early when it determines that a branch of the trie cannot contain anything that scores high enough. By having the arguments be swappable during search, it foils both of those optimizations. The former, because the partial solution we reuse to avoid the duplicate work becomes invalidated if we have to swap the term and candidate. The latter, because we can no longer say "there are no matches in the rest of this branch, skip it" because, well, there might be. If the candidate can be a substring of the search term, then we more or less have to visit every candidate to know for sure. Which means, it's end up being no better than a for loop over every candidate.

I'm afraid a search trie is just not a workable optimization for what you're asking for.

EthanRutherford commented 2 years ago

You may be able to accomplish what you want with an alternative approach, though.

If what you're after is "sentence similarity", you could tokenize your sentences prior to search, and then use the number of matching words/the scores of each matching word to determine similarity:

class SentenceSearcher {
    constructor(sentences) {
        this.wordSearcher = new Searcher(sentences, {keySelector: this.tokenize, useSellers: false, returnMatchData: true});
    }
    tokenize(string) {
        // this function tokenizes your string, returning e.g. ["hello", "world"] for "hello world"
        ...
    }
    search(sentence) {
        const tokens = tokenize(sentence);
        const matchingSentences = new Map();
        for (const token of tokens) {
            for (const result of this.wordSearcher.search(token) {
                if (matchingSentences.has(result.item) {
                    matchingSentences.get(result.item).matches.push(result.score);
                } else {
                    matchingSentences.set(result.item, {sentence: result.item, matches: [result.score]});
                }
            }
        }

        return [...matchingSentences.values()].sort((a, b) => {
            // here you would compare number of matches and/or score of matches to determine which sentence is more similar
            // for example, `return a.matches.length - b.matches.length`
            ...
        });
    }
}
ObliviousDonkey commented 2 years ago

to get the expected result I've to do something like this:

searcher.search('oops', ['oops oops', 'kk'], { useSellers: true }):

oops is a substr of oops oops and it will return match { index: 0 } But there's another oops in the string. Can the trie be modified to get something like { index: 4 }? And pass an option like startFrom:

searcher.search('oops', ['oops oops', 'kk'], { useSellers: true, startFrom: 4 }):

It will search the term from index 4 in candidate? This is essential for sentence searcher

ObliviousDonkey commented 2 years ago

hlo

ObliviousDonkey commented 2 years ago

Is this possible, Rutherford?

ObliviousDonkey commented 2 years ago

Closing the issue. But I still need your reply? How can I get results after specific index using search tries?

EthanRutherford commented 2 years ago

This is where splitting the sentences into separate tokens (in this case, words) would get you what you're looking for. Splitting the string beforehand, and keeping track of the index each word appears at, and then passing in those words to a searcher, with references back to the original sentence, would allow you to get results for each instance of a match in any given string.

ObliviousDonkey commented 2 years ago
const x = ['foo bar foo'];
const y = new Searcher(x);
const str = 'bar foo';

const tokens = sentence.split(' ');
        const match = new Map();
        for (const token of tokens) {
            for (const result of this.wordSearcher.search(token) {
                console.log(result.matches)
                }
            }
        }

This'll probably log { index: 3 at first and then { index: 0 } at last. That means result.item has a foo bar. But I was searching for bar foo which is still in the string. The searcher returned the results before walking the whole string.

ObliviousDonkey commented 2 years ago

Please reply, I think it's possible to implement the startFrom option

ObliviousDonkey commented 2 years ago

This is where splitting the sentences into separate tokens (in this case, words) would get you what you're looking for. Splitting the string beforehand, and keeping track of the index each word appears at, and then passing in those words to a searcher, with references back to the original sentence, would allow you to get results for each instance of a match in any given string.

Inefficient solution and may result in bugs. Since, I expected to get the latter foo, but your searcher returned the first one to make it more "Efficient".