EthanRutherford / fast-fuzzy

Fast fuzzy search utility
ISC License
368 stars 8 forks source link

Optimization tips? #29

Open ismail-doitbig opened 10 months ago

ismail-doitbig commented 10 months ago

Hi,

I have a table with 5k rows and 5-6 columns with 100-500 chars where i want to make some autocomplete functionality

so when you type ra, rap, rapt they all do a search.

How can i optimize this?

The loading time is 5 seconds, extremely slow

EthanRutherford commented 10 months ago

Optimizing load times is something I likely should look into internally. The gist is that we're building an index, and it's all done upfront. At current, this loading time can be mitigated by a. initializing the searcher ahead of time, before the user enters the textbox to search, and b. limiting the number of candidates added to the searcher at a time, and spacing out the initialization. something like this (pseudo-code)

async function loadSearcher(strings, options) {
    // adjust these to suit your needs
    const batchSize = 1000;
    const sleepTimeMs = 20;

    const searcher = new Searcher(options);
    for (let i = 0; i < strings.length; i++) {
        searcher.add(strings.slice(i, batchSize));

        // relinquish control to allow other main thread work to proceed (assume a method which does a setTimeout)
        await sleep(sleepTimeMs);
    }

    return searcher;
}

Alternatively, one could place the searcher into a webWorker to keep initialization off the main thread entirely. Depending on use-case, it could also be useful to cache the searcher's index somewhere. It's not the prettiest, but one can access the index through searcher.trie and searcher.count, to stash them somewhere e.g. localStorage/indexedDb and restore them later. This is one of a number of features I have in mind to improve for a V2.

so when you type ra, rap, rapt they all do a search.

If I understand correctly, you would like to not do a search on every keystroke? That can be accomplished with a debounce method (for example, lodash debounce) to delay a search until the user stops typing.

plutonium-239 commented 1 month ago

Hi, I would also like for the loading times to either improve or be able to be offloaded to web workers.

Can I build a Searcher in the worker, send the trie and count over to main, and then just assign them on a new Searcher object?

Edit: these properties aren't accessible Edit2: it's just not exported to typescript but they're there Also, it's erroring out on saving to local storage because it's too large

EthanRutherford commented 1 month ago

Converting the trie into a DAWG would help on size, though it would take additional time. In addition, using indexedDB as a storage location would likely give much more space. localStorage has a limit of 5MB of data storage (and the fact you have to store objects as JSON strings doesn't help), but indexedDB can store much more, depending on which browser and how much free disk space there is. It can easily store many GB of data though, and also stores objects via structured clone instead of json strings, so objects will take less space anyway.

As far as improving load times, I have looked into it since the original reply here, and wasn't at the time able to find options for improving load times, the total amount of preprocessing needed to normalize strings and create the trie is about as stripped down as it can be. One thing that could be done on the consumer side is to use the option useSeparatedUnicode: true, which avoids processing strings to combine multiple characters into grapheme clusters. If you know your data doesn't have multicharacter unicode graphemes to worry about, that option can be used to skip this step, which is a decent chunk of the preprocessing cost.

Due to some difficult life events, I haven't worked on v2 in a while, but there are some things I was planning on including to improve the perceived performance. Things like integrating directly with indexedDB, adjusting the default options to make the more expensive processing opt-in instead of default, supporting collaborative scheduling, webworkers, or even lazy/deferred processing, to do preprocessing on-demand, during search, to spread out the preprocessing cost more evenly.

plutonium-239 commented 1 month ago

I ended up utilizing web workers completely. The Searcher building and, subsequently, all search calls are offloaded to the web worker, which results in a nice and responsive main thread, and the building time in the background also goes mostly unnoticed because of no stutter.

You can check out the final result here: https://personal-dictionary.pages.dev/ (the main input field is using the fuzzy search over a list of 370k words). Note that the searches are debounced to 200ms.

Although, I would still like to utilize the indexedDB to maybe fetch and store the wordlist instead of fetch every time (it's about 3MB).

In conclusion: Thanks for the library!

EthanRutherford commented 1 month ago

Nice!