LintGroup::lint() takes >1s with ~35kb of markdown

tvanderstad commented 3 weeks ago

Me again 👋

I managed to incorporate harper-core into the markdown text editor for Lockbook and included a recorded demo in a PR. It was super easy to get set up with, so, props for that.

Because Harper emphasizes speed, I implemented it so it runs synchronously on every keystroke. This ended up being a Bad Time™️ when I opened my monthly dev log, a ~35kb markdown file, which takes around ~1.2s to lint.

I can get it usable by integrating it asynchronously, but your readme makes bold performance claims and asks for performance bug reports. I also don't see why grammar checking couldn't be fast - there's a couple things we get away with doing every keystroke, including parse the markdown AST.

Here's the code from that PR that measures the performance so you know I'm doing it right:

    let start = std::time::Instant::now();
    let mut lints = linter.lint(&document);
    println!("linting took {:?}", start.elapsed()); // prints values like 1.2s for ~35kb of text

elijah-potter commented 3 weeks ago

I'm honored that you're interested in using Harper for Lockbook! I'm quite certain your slowdown is coming from our spell-checker implementation, which is currently quite slow and being rewritten by @grantlemons. Try disabling it and give it another go.

tvanderstad commented 3 weeks ago

I tried disabling spell check using the following config and the runtime came down to ~15ms 👍

    let mut linter = LintGroup::new(
        LintGroupConfig {
            spelled_numbers: Some(true),
            an_a: Some(true),
            sentence_capitalization: Some(true),
            unclosed_quotes: Some(true),
            wrong_quotes: Some(true),
            long_sentences: Some(true),
            repeated_words: Some(true),
            spaces: Some(true),
            matcher: Some(true),
            correct_number_suffix: Some(true),
            number_suffix_capitalization: Some(true),
            multiple_sequential_pronouns: Some(true),
            linking_verbs: Some(true),
            avoid_curses: Some(true),
            terminating_conjunctions: Some(true),
            ellipsis_length: Some(true),
            dot_initialisms: Some(true),
            spell_check: Some(false), // tl;dr this is the only one that's false
        },
        dict,
    );

elijah-potter commented 3 weeks ago

Fantastic. Once @grantlemons has made his spell checking changes this should be a resolved issue. We shall keep you up-to-date, but I can't imagine it will be too long.

tvanderstad commented 3 weeks ago

I noticed Grant's working on this (perhaps even as I write!) and wanted to put this out there: have you considered delegating to another library for spell checking? As I understand, Harper is a grammar checker competing with Grammarly and LanguageTool by emphasizing performance and privacy. Competing with spell checkers like Hunspell doesn't seem to be the goal because implementations like Hunspell don't have the drawbacks of API-based grammar checkers. I would specifically recommend trying Spellbook (out of the handful suggested for comparison in #215).

Here's their API, reproduced from the top of their readme:

fn main() {
    let aff = std::fs::read_to_string("en_US.aff").unwrap();
    let dic = std::fs::read_to_string("en_US.dic").unwrap();
    let dict = spellbook::Dictionary::new(&aff, &dic).unwrap();

    let word = std::env::args().nth(1).expect("expected a word to check");

    if dict.check(&word) {
        println!("{word:?} is in the dictionary.");
    } else {
        let mut suggestions = Vec::new();
        dict.suggest(&word, &mut suggestions);
        eprintln!("{word:?} is NOT in the dictionary. Did you mean {suggestions:?}?");
        std::process::exit(1);
    }
}

I POC'd a Spellbook integration for Lockbook and it linted that 35kb Markdown file in ~1ms. Perhaps that's due to memory optimizations they document. Perhaps it's due to their interface: they check single words at a time and offer suggestions separately. Because I'm writing a markdown editor, I already have my markdown parsed all the time (for rendering) and my words parsed all the time to support features like jumping the cursor to the next word boundary with alt+left and alt+right. So, while it's extra work for me to segment words and not spell check words in code blocks, it's work that can be done once in my app instead of Harper redo'ing that work for convenience. On suggestions, I don't need all the suggestions provided up front because they'll only appear in my app for one word at a time (when the user right-clicks a word with a red underline), so Spellbook's suggest-for-a-single-word API is perfect and presumably saves a lot of work.

Since Grant's rewriting Harper's spell checking, it's hard to compare, but for suggestions I noticed Grant added fst = { version = "0.4.7", features = ["levenshtein"] }. fst's author comments that "this implementation is a proof of concept," "not speedy," and "can use enormous amounts of memory." Maybe that comment is exaggerated and everything will work out great, or maybe Grant's got some trick up his sleeve, but I'll believe your performance claims when I see them :)

A big asterisk here is that Spellbook's suggestions are work-in-progress. I had to pull code from their master branch and comment out a todo!() to get it not to crash, but even so, the suggestions are good and it seems usable. Their suggester was last modified a month ago (active enough for me! 🤷‍♂️). It's detail-oriented and factors in things like keys that are adjacent on the keyboard - stuff that's presumably useful and not very fun to re-implement.

Anyway, thanks for indulging my rant. This has been my case for delegating spell checking to Spellbook. I actually got the idea from a comment by the maintainer of Spellbook himself on their repo's only open issue.

grantlemons commented 2 weeks ago

"this implementation is a proof of concept," "not speedy," and "can use enormous amounts of memory."

While you are correct that this is in the fst docs, it isn't really true for the dictionary size we're talking about, and I also pivoted to the more mature levenshtein_automata crate, which should be better, though I haven't really investigated the memory footprint much.

The feature is in pull request right now, but benchmarks show uncached spellchecking as being about 10x faster than before, though I'm not sure how it performs relative to spellbook. Feel free to try it out.

As for spellbook integration, @elijah-potter and I discussed it recently, and there was a reason it wouldn't work, though I can't remember it off of the top of my head. Maybe user or file dictionaries?

elijah-potter commented 2 weeks ago

As for spellbook integration, @elijah-potter and I discussed it recently, and there was a reason it wouldn't work, though I can't remember it off of the top of my head. Maybe user or file dictionaries?

We actually use our dictionary for things other than spellchecking, namely tagging words in sentences based on which roles they could fulfill. For example, Spellbook doesn't allow us to mark words like house as a noun. Further, it doesn't provide the information needed to mark tokens like Carl's as a genitive case proper noun in phrases such as Carl's parking spot.

We could still use Spellbook for spellchecking anyway, but it would mean bundling a duplicate dictionary with the binary, which isn't very conducive to our goal of minimizing bundle size. That doesn't mean we won't do it, but it might just be easier to get our spellchecking implementation up to snuff instead.

tvanderstad commented 2 weeks ago

Ah okay, makes sense to me! I got caught up working on another feature but I'll give it a try in a few days and let you know how it goes

elijah-potter / harper

LintGroup::lint() takes >1s with ~35kb of markdown #257