errata-ai / vale

:pencil: A markup-aware linter for prose built with speed and extensibility in mind.
https://vale.sh
MIT License
4.44k stars 153 forks source link

[WIP] Multilingual, spaCy-powered NLP #356

Open jdkato opened 3 years ago

jdkato commented 3 years ago

All of the required pieces are finally in place to offer integration with spaCy:

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products.

This allows Vale to support (1) rules written for any of spaCy's supported languages and (2) highly accurate (custom-trained, even) NLP.

If implemented well, I think this has the potential to easily 2x Vale's usefulness.

Getting Started

The current version of Vale (v2.10.4) has unofficial support (since the implementation details are still a WIP) for this integration.

To get started, you'll need Vale (v2.10.4), Python 3.9, and Pipenv installed. Next, follow the steps below:

  1. Start the spacy-vale API locally.

  2. Create a .vale.ini file:

    StylesPath = styles
    MinAlertLevel = suggestion
    
    # This is the API started in step 1. 
    #
    # You'll need to change this value to the URL provided in uvcorn's output.
    NLPEndpoint = http://0.0.0.0:5000
    
    [*.md]
    # This is the language of the documents matched by the above glob pattern 
    # (`*.md`, in this case).
    Lang = en
    
    ...
  3. Create a style/rules (see next section).

Creating, testing, and debugging rules

The main entry point for NLP-based rules will be the sequence extension point. For example, an implementation of LanguageTool's WOULD_BE_JJ_VB rule:

---
extends: sequence
message: "The infinitive '%[4]s' after 'be' requires 'to'. Did you mean '%[2]s %[3]s *to* %[4]s'?"
tokens:
  - tag: MD
  - pattern: be
  - tag: JJ
  - tag: VB|VBN

To help the testing process, you can use Vale Studio's View Tags feature, which currently supports Markdown content written in Chinese, English, German, Russian, or Spanish.

Screen Shot 2021-07-07 at 6 58 04 PM

Finally, you'll be able to use these NLP-based rules with all existing integrations—such as VS Code shown below.

Screen Shot 2021-07-07 at 6 33 46 PM

Feedback

Please report any issues you encounter: linting speed, Vale Studio usability, sequence limitations, etc.

ashemedai commented 2 years ago

@jdkato The link to spacy-vale is a 404. Did the repository get deleted? Curious about the work here since I have messed with spaCy myself before and am keen to use such features with some languages I deal with, e.g. Dutch.

jdkato commented 2 years ago

It moved to https://github.com/errata-ai/nlpapi. Nothing is really concrete yet, though.

honzajavorek commented 6 months ago

I'm completely new to vale, but it would be awesome if I could use it for Czech. I'd probably be a heavy user then. Being busy these days, I don't want to promise much, but I think I could find some time later on to try out the NLP-based rules. I'm also a Python dev, so external dependencies in Python do not scare me 😄 If I understand it correctly, this is very beta now and I'll have to write my own rules to test it out - there are no predefined rulesets I could use for Czech right now, is that correct?

Btw, the link to sequence is broken too: https://docs.errata.ai/vale/styles#sequence-v230 Not sure where to find it now, couldn't see any docs linked to https://github.com/errata-ai/nlpapi

I see there are tags and patterns. Where can I learn about the tags? Not sure what MD or JJ means.