errata-ai / vale

:pencil: A markup-aware linter for prose built with speed and extensibility in mind.
https://vale.sh
MIT License
4.39k stars 147 forks source link

Part of speech identification seems buggy #775

Open daobrien opened 6 months ago

daobrien commented 6 months ago

Check for existing issues

Environment

Fedora Linux 38 Installed from RPM vale version 3.0.7

Describe the bug / provide steps to reproduce it

Trying to write a rule to identify complex adjectives, which should be hypenated. E.g., in the phrase "the upper left corner", "upper-left" should be hyphenated.

The rule currently appears as follows:

extends: sequence
message: "Use '%[1]s-%[2]s %[3]s', because '%[1]s-%[2]s' is an adjective."
level: error
tokens:
  - tag: JJ
    pattern: upper|lower
  - tag: JJ
    pattern: left|right
  - tag: NN

We've used several test cases and cannot get consistent results:

The status icons are in the lower left corner.

The status icons are in the upper left corner.

The status icons are in the upper right corner.

The status icons are in the lower right corner.

Vale only catches the last test case.

We used Vale Studio to test the parts of speech, but the results are inconsistent:

image

This blocks further development of this rule for us. Would really appreciate any help. Thanks.

jdkato commented 6 months ago

Unfortunately, there's no straightforward solution here.

I'd argue that "buggy" is the wrong word here; the results are actually objectively good. For comparison, the NLTK (a very widely-used NLP library) gives the same exact results when using its default tagger.

And when you consider the other constraints Vale has (~20MB binary, offline, no NLP installation dependencies, etc.), the results are very good.

That said, the fact that I had to write my own NLP library to even get this far is obviously not ideal. I've tried a number of ideas to incorporate third-party libraries but it complicates the installation / setup process pretty significantly.

For example, two of the best available libraries:

Just aren't that practical for many of Vale's use cases.

I'm not sure what the solution here is yet, but it's definitely something that I've put a lot of time into trying to improve.

daobrien commented 6 months ago

Thanks for your explanation of what's going on. Maybe s/buggy/imperfect/ and obviously enough getting perfect software is really hard. I can pass all this on to the team who help me with our Vale setup, but relying on local servers is probably not something they'll get excited about.

Feel free to update the status of this to whatever you deem appropriate. David