Part of speech identification seems buggy

daobrien commented 6 months ago

Check for existing issues

[X] Completed

Environment

Fedora Linux 38 Installed from RPM vale version 3.0.7

Describe the bug / provide steps to reproduce it

Trying to write a rule to identify complex adjectives, which should be hypenated. E.g., in the phrase "the upper left corner", "upper-left" should be hyphenated.

The rule currently appears as follows:

extends: sequence
message: "Use '%[1]s-%[2]s %[3]s', because '%[1]s-%[2]s' is an adjective."
level: error
tokens:
  - tag: JJ
    pattern: upper|lower
  - tag: JJ
    pattern: left|right
  - tag: NN

We've used several test cases and cannot get consistent results:

The status icons are in the lower left corner.

The status icons are in the upper left corner.

The status icons are in the upper right corner.

The status icons are in the lower right corner.

Vale only catches the last test case.

We used Vale Studio to test the parts of speech, but the results are inconsistent:

This blocks further development of this rule for us. Would really appreciate any help. Thanks.

jdkato commented 6 months ago

Unfortunately, there's no straightforward solution here.

I'd argue that "buggy" is the wrong word here; the results are actually objectively good. For comparison, the NLTK (a very widely-used NLP library) gives the same exact results when using its default tagger.

And when you consider the other constraints Vale has (~20MB binary, offline, no NLP installation dependencies, etc.), the results are very good.

That said, the fact that I had to write my own NLP library to even get this far is obviously not ideal. I've tried a number of ideas to incorporate third-party libraries but it complicates the installation / setup process pretty significantly.

For example, two of the best available libraries:

CoreNLP (~482 MB download, would require a local Java server).
spaCy (~436 MB download, would require a local Python server).

Just aren't that practical for many of Vale's use cases.

I'm not sure what the solution here is yet, but it's definitely something that I've put a lot of time into trying to improve.

daobrien commented 6 months ago

Thanks for your explanation of what's going on. Maybe s/buggy/imperfect/ and obviously enough getting perfect software is really hard. I can pass all this on to the team who help me with our Vale setup, but relying on local servers is probably not something they'll get excited about.

Feel free to update the status of this to whatever you deem appropriate. David

errata-ai / vale