bminixhofer / nlprule

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Apache License 2.0
593 stars 39 forks source link

Document how to load custom rulesets #82

Open ssokolow opened 2 years ago

ssokolow commented 2 years ago

I have a project where I'd prefer not to reinvent nlprule for applying my custom grammar rules (common validly-spelled typos I see in fanfiction), but the documentation is very unclear on how to do anything with custom rules.

  1. In a PyQt application, how do I specify files by path like with the Rust API?
  2. How do I go from the raw LanguageTool XML to the .bin files?
  3. Do I need to do multiple passes with different nlprule instances if I also want to check regular grammar stuff or is there a way to merge rulesets?
bminixhofer commented 2 years ago

nlprule currently is not very easily extensible w.r.t custom rules.

To answer (2) you can look at https://github.com/bminixhofer/nlprule/tree/main/build. The binaries are compiled from a directory containing all needed resources. You could add your rules to the grammar.xml in this directory and compile a new binary yourself.

ssokolow commented 2 years ago

That's fine in and of itself for now. I more intended this as a request for:

  1. A more obvious path to discover the existence of https://github.com/bminixhofer/nlprule/tree/main/build from the README or the docs.rs documentation.
  2. Clarification on whether "the nlprule binaries" refers to just the .bin files or also the build of the Rust code used in the Python distributables and, if the latter, how to minimize the amount of needless work done when the intent is just to generate from a replacement grammar.xml.
  3. Clear instructions on how to access the Rust version's path-taking APIs from the Python bindings, where the documentation covers letting it do path lookup for you.

...ideally all together in a single "How to customize the grammar rules" section.

bminixhofer commented 2 years ago

I agree, this should be improved.

To answer (2), I refer to the nlprule binaries as only the .bin files, not the build of the Rust code. But I see now where the confusion is coming from, maybe a different name would be better to refer to the LanguageTool data to avoid confusion.

ssokolow commented 2 years ago

You could just call them ".bin files". That'd work.