FuzzyCompletions support to limit number of results

subramanyam86 commented 7 years ago

It would be nice for the FuzzyCompletions api to support an additional parameter to limit number of results. This could speed up fuzzy completions on shorter queries.

hendrikmuhs commented 7 years ago

FuzzyCompletions is doing a depth-first search approach, not necessarily finding the best completions first like it is the case for prefix completion. Thats also why there is not API for that on the CPP implementation like for prefix completion. So its not as easy as it seems.

Having that said there are a couple of things possible to speed it up:

Issue #56 would give a massive speed improvement. Guesstimate, based on the impact for dumping: at least double the speed.
Use a exact first traversal approach: This is e.g. done in GetNear for geospatial lookups, it prefers the exact match before any fuzzy matches, trying to avoid edit distance penalties

But I think the most important thing to look at would be an API that calculates scores based on the combination of edit distance and dictionary weight. This function has to be given from the outside. Given such a function it may be possible to implement your original suggestion if this API can also return upper bounds for edit distances.

Last bot not least, referring to 'fuzzy completions on shorter queries': Another good strategy is to have a lookup table on the caller side limiting the max-edit distance parameter based on input length, e.g. for lengh < 5: max-editdistance = 1, length < 8: max-editdistance = 2, length > 8, max-editdistance = 3 etc (this is more or less the reverse of the API outlined above).

The current fuzzy matching implementation is just at the very beginning of what is possible.

subu-cliqz commented 7 years ago

@hendrikmuhs Thanks for the detailed explanation Hendrik! I am implementing the client side logic as you explained already. The issue #56 sounds interesting, will speak with Narek about implementing this. Thanks again!

hendrikmuhs commented 7 years ago

fixed: it is depth-first, not breadth first. Sorry, just realized, was a bit sleepy writing this.

The main reason for depth-first is the very small memorization needed when traversing the structure. Breadth-first would require lots of memory and therefore would be problematic in terms of scale.

cliqz-oss / keyvi

FuzzyCompletions support to limit number of results #212