GitbookIO / hunspell-spellchecker

Parse and use Hunspell dictionaries in Javascript
Apache License 2.0
77 stars 18 forks source link

Massively speed up the `correct` function. #18

Open FeepingCreature opened 5 years ago

FeepingCreature commented 5 years ago

Massively speed up the correct function.

correct is slow because it calls edit1 twice recursive. Especially for long words, this can lead to low millions of words to generate and test.

Instead: build a trie of letter histograms for every word in the dictionary. Then, add a function findSimilarWords to find words whose histogram is similar to the one provided plus some changes. Use this to find candidate words for an edit distance of 2 (histogram distance of 4; two replaces). Then do a bidirectional comparison to find the words that actually match.

Instead: build a recursive treetrie of all the words in the dictionary. Then, add a function findSimilarWords that walks the tree using a provided word as a guide, making changes during iteration, and aborting when the current set of changes leads to a dead end.

Tree initialization is lazily done once correct is called. It takes about two seconds less than a second maybe 400ms. After that, correct is effectively instant (~10ms on en_US).

Further ideas for improvements: a proper trie data structure could be used for performance. Done!

FeepingCreature commented 5 years ago

rename trie to tree. it's a tree.

FeepingCreature commented 5 years ago

On reflection, the histogram part is kind of unnecessary to the algorithm. I'll remove that probably.

FeepingCreature commented 5 years ago

Removed. It's even a lot faster to build now! (Maybe three or four times faster.)

FeepingCreature commented 5 years ago

It's a trie now.

FeepingCreature commented 5 years ago

It's another 25% faster! I think I'll stop here, the code is getting a bit ugly.

Ulrikop commented 5 years ago

great, that would fix #9