bjascob / LemmInflect

A python module for English lemmatization and inflection.
MIT License
258 stars 25 forks source link

Stanford Morphology class uses POS tags #14

Closed AngledLuffa closed 1 year ago

AngledLuffa commented 2 years ago

I'm not sure how you used the lemma annotator for CoreNLP to test the lemmatizer, but the Morphology class definitely does use POS tags if available:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/process/Morphology.java

For example, the WordTag stemStatic(String word, String tag) interface

FWIW, the next version of CoreNLP will cover ADJ & ADV as well

bjascob commented 2 years ago

It's been a while since I ran these test but it looks like it was run against Stanford CoreNLP version 2018-10-05, aka v3.9.2. Since this is a python library I had to use SNLP's web interface and run the CoreNLP as a server. I don't think the call you're referencing is available via the html requests, hence the inability to specify the POS tags.

AntonOfTheWoods commented 1 year ago

@bjascob , I'm also very interested in seeing an updated comparison, particularly with the latest spaCy 3.4 and corenlp 4.5.1!

AngledLuffa commented 1 year ago

If you wind up rerunning it, we can make sure there's a suitable interface for the POS tag version of Morphology

bjascob commented 1 year ago

Can someone tell me how to call SNLP from Python with a word its Penn style POS tag and have it return the lemma? The built-in SNLP web server interface I used previously is setup to parse an entire sentence, not to take in a single word. If SNLP has the capability of doing this, it should make for a much better test. The last time I looked (probably several years ago) I didn't see a way to do this.

BTW.. I won't have time to spend on this in the next month but after that I'm up for revising the testing if someone has a way to get this info from SNLP. A code snippet of the HTML commands to do this would be ideal. If this should be done through Stanza instead of SNLP's web interface let me know that too. If so, sample code on how to use that lib to get the info would be helpful.

AngledLuffa commented 1 year ago

Are you starting from known POS tags, or raw text you want tagged?

bjascob commented 1 year ago

Starting from words with known POS tags and asking for the lemma. There are no sentences to parse in the Automatically Generated Inflection Database (AGID) used for testing.

AngledLuffa commented 1 year ago

Sounds good. I will add a Python - Java interface which allows for adding lemmas to tagged words.

AngledLuffa commented 1 year ago

Hmm, one thing that will be tricky will be that sometimes we distinguish between ADJ and ADV. It's relevant for "best", "worst", "better", and "worse". Also verb & noun forms may depend on the particular POS used. In general I think it will be okay, though

AngledLuffa commented 1 year ago

I added this to the dev branch of CoreNLP:

https://github.com/stanfordnlp/CoreNLP/commit/71bc95dfaf984f7056e0856414738be0706cf9e3

I added this to the dev branch of stanza:

https://github.com/stanfordnlp/stanza/pull/1144

I expect both to be released in later November or early December, hopefully. If you need it sooner, let us know. It uses xpos tags, but seeing as how you have just N, V, or A, I think you can get away with changing all nouns to NN, leaving all verbs as V, and changing all adjectives & adverbs to JJS if the word ends with "est" and JJR otherwise. Although this will have some weird effects on words such as "honest", not to mention the small handful of words which get treated differently if they are adjectives or adverbs.

bjascob commented 1 year ago

Great. If you remember, drop a quick note in here when it's released and I'll get an email update. If not, I'll likely remember to check back in a month or so anyway.

Can I assume that the def main() in morphology.py is good example code on how to use this in the Stanza library (I haven't used Stanza before, just direct calls to the SNLP server).

As a note to myself --> the Stanza API takes Penn Treebank style tags and the Lemminflect inflection test corpus only has VAN tags (V, A or N). For testing, convert VAN tags to the closest PTB style tag. Consider trying all possible PTB tags for the word to verify that scores are not artificially lower due to the conversion.

AngledLuffa commented 1 year ago

Yep! That was the intent. I could also add some other interface, such as passing in tuples of (word, tag)

bjascob commented 1 year ago

I went ahead and compiled the "dev" versions of stanza 1.5 and CoreNLP 4.5.2 and re-ran tests. The results were...

Stanza version: 1.5.0
119,310 total test cases where 0 had no returns.
27.0 usecs per lemma
5,440 incorrect lemmas = 95.4% accuracy
Results by pos type
  VERB     :   2,596 / 43,171 =  94.0% accuracy
  ADJ/ADV  :     247 /  3,530 =  93.0% accuracy
  NOUN     :   2,597 / 72,609 =  96.4% accuracy

Since the AGID only has V,A or N for part-of-speech and Stanza wants the PennTreekbank tag, the code tries all the relevant PTB tags, creates a set of possible answers and considers the result "passed" if the correct answer was in the set. Also note the time/lemma is for passing in the entire set at once. If I call it one at a time it would take all day (literally).

BTW... the numbers here are very close to what I get in Lemminflect. The AGID used here for testing is not necessarily a gold standard and even English experts may disagree on the "correct" answer in some cases. I suspect mid-90s accuracy (aka agreement) is probably as good as it's ever going to get. I'll put a note to this effect in README.

The interesting thing here is really how poorly NLTK and Spacy perform compared to state-of-the-art.

bjascob commented 1 year ago

@AngledLuffa I noticed that Stanza 1.5 and CoreNLP 4.5.3 just released and thought I'd re-test them. Stanza 1.5 works fine but CoreNLP 4.5.3 does not include the new morphology class. That's still only in the dev branch (just FYI in case this is an oversight).

Also note that the Stanza morphology code does not access java if you only set CORENLP_HOME as per the instructions on the main page. For it even to attempt to make the java call CLASSPATH must be set. It looks to me like this is due to a check in the new python morphology code. I can file a bug report for this if you want.

AngledLuffa commented 1 year ago

Whoops, thanks for catching that. Our distribution packaging script starts from specific main programs rather than including all of our repo, and I forgot to add that base program. I will do so now and make a new CoreNLP 4.5.4. I made a couple changes to Ssurgeon based on feedback from presenting it to people over the weekend (with more to come).

AngledLuffa commented 1 year ago

Also, the dev branch of stanza now doesn't need $CLASSPATH, since CoreNLP 4.5.4 should have it correctly included

https://github.com/stanfordnlp/stanza/commit/4dda14bd585893044708c70e30c1c3efec509863

bjascob commented 1 year ago

OK. Looks like things are working as they should and I've updated the Readme to reflect CoreNLP 4.5.4. I assume this closes the issue but feel to add more comments if there are additional issues here.

AngledLuffa commented 1 year ago

Awesome, thanks for the update!

I'd be interested to see what we're getting wrong, especially for adj & adv. It's deterministic, so there may very well be a class of words we missed.