cliqz-oss / keyvi

Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.
https://cliqz.com
Apache License 2.0
179 stars 38 forks source link

FuzzyCompletions support to limit number of results #212

Open subramanyam86 opened 7 years ago

subramanyam86 commented 7 years ago

It would be nice for the FuzzyCompletions api to support an additional parameter to limit number of results. This could speed up fuzzy completions on shorter queries.

hendrikmuhs commented 7 years ago

FuzzyCompletions is doing a depth-first search approach, not necessarily finding the best completions first like it is the case for prefix completion. Thats also why there is not API for that on the CPP implementation like for prefix completion. So its not as easy as it seems.

Having that said there are a couple of things possible to speed it up:

But I think the most important thing to look at would be an API that calculates scores based on the combination of edit distance and dictionary weight. This function has to be given from the outside. Given such a function it may be possible to implement your original suggestion if this API can also return upper bounds for edit distances.

Last bot not least, referring to 'fuzzy completions on shorter queries': Another good strategy is to have a lookup table on the caller side limiting the max-edit distance parameter based on input length, e.g. for lengh < 5: max-editdistance = 1, length < 8: max-editdistance = 2, length > 8, max-editdistance = 3 etc (this is more or less the reverse of the API outlined above).

The current fuzzy matching implementation is just at the very beginning of what is possible.

subu-cliqz commented 7 years ago

@hendrikmuhs Thanks for the detailed explanation Hendrik! I am implementing the client side logic as you explained already. The issue #56 sounds interesting, will speak with Narek about implementing this. Thanks again!

hendrikmuhs commented 7 years ago

fixed: it is depth-first, not breadth first. Sorry, just realized, was a bit sleepy writing this.

The main reason for depth-first is the very small memorization needed when traversing the structure. Breadth-first would require lots of memory and therefore would be problematic in terms of scale.