explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

Incorrect lemmatization of some words #3665

Closed bjascob closed 5 years ago

bjascob commented 5 years ago

Issue

For some words, spaCy doesn't produce the correct lemma. Using an automated method I found about 400 incorrect lemma forms. See mismatches.txt. This is a list of potential issues that would need to be reviewed by hand before inclusion in the exceptions lists.

Test Technique

I had spaCy parse (using en_core_web_sm) through the Gutenberg corpus (what's in NLTK) and then tested the produced lemma against a lookup table I created from the NIH SPECIALIST Lexicon. The table maps words to a list of potential base forms (it's basically a dictionary based lemmatizer instead of rules based). I didn't look at proper nouns or other forms that spaCy doesn't inflect (except adv where there were 28 un-handled by spaCy).

The test could easily be run with another model (which might give slightly different tagging in some cases) or a different corpus (which could include more words or possibly more "modern" English). If you have opinions on either, let me know.

Environment

Proposed Fix

It's a bit of work to hand edit the above list and add it to the code so I wanted to check with the experts and get approval / opinions before I went through this. In addition, I'd like to verify that the inconsistencies PR, referenced above, will be included in the next release and that you aren't planning on any big changes to the lemmatizer. Those two things could impact the test results and list of changes.

Here's what I propose.

Alternately, instead of patching the holes we could consider upgrading to a more extensive set of rules and/or moving to a dictionary based approach. However, either of these would require a fair bit of work and could require that lemmatizer.py deal with English different from other languages.

Let me know your thoughts on this.

honnibal commented 5 years ago

Thanks for the work on this. You should probably hold off on making the corrections until at least v2.1.4 is released, as I did fix a bug in the lemmatizer. It might be affecting some of these results.

The other thing that's happening with the lemmatizer is, we're switching over to support richer morphological features, which will allow us to write much better rules. This should improve accuracy significantly.

bjascob commented 5 years ago

If you're going to make lemmatizer changes, you might consider looking at LemmInflect (my code). It has a small character based neural-net classifier that looks at the word and selects 1 of 34 different automatically generated lemmatization rules. Since it's a NN, it works great on OOV words, as well as dictionary ones. The net is small enough that it doesn't take up hardly any CPU time and it's implemented with numpy so there aren't any additional 3rd party libs needed.

The module also has code in it for parsing and using the NIH's SPECIALIST Lexicon which is a great resource for morphology info on English words. It has about 500K words in it and appears to be very accurate. Check it out if you need corpus resources for this.

I'd be happy to contribute if you get to a point were you're interested.

honnibal commented 5 years ago

Really nice module, thanks!

ines commented 5 years ago

Merging this with the master issue in #2668!

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.