elijah-potter / harper

The Grammar Checker for Developers
https://writewithharper.com
Apache License 2.0
1.93k stars 35 forks source link

feat: FST-based Curated Dictionary Spellchecking #258

Closed grantlemons closed 2 days ago

grantlemons commented 3 weeks ago

Created a new dictionary FstDictionary that uses a finite-state transducer (FST)-based map underneath. This precomputed map is built based on the dictionary file at compile time, and allows for extremely fast edit-distance calculations used in the spellchecking logic.

FstDictionary is ideal for the curated dictionary, but is immutable, so the current FullDictionary implementation is still used for user and file dictionaries.

Using this new dictionary results in much faster spellchecking (about 10x faster when not yet cached per benchmark below). image

One downside to this PR is that it moves the hunspell parsing stuff out into a new crate harper-dictionary-parsing, which makes some of the imports a little weird. For instance, harper-core now imports Span from harper-dictionary-parsing.

~Also, there is a noted bug in the fst library with its handling of japanese characters, but for the time being this should not pose an issue as multi-language support is nowhere near being implemented. (#138, fst/#38)~

grantlemons commented 3 weeks ago

Made it more correct (better utf-8 edit-distance support using the levenshtein-automata crate mentioned in #138) and a tiny bit faster by storing the DFA builders thread-locally 💪

image

grantlemons commented 6 days ago

Update: For correctness reasons it is now only 80% faster.