Incorrect lemma for "taxes"

jaleskovec commented 1 year ago

Description of issue

I'm not sure if this should be an accepted side-effect of the model or not so I apologize in advance if this is not considered a bug. When obtaining the lemma for "taxes", it returns "taxis".

How to reproduce the behaviour

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("sentencizer")

doc = nlp('taxes are high')
sentence = list(doc.sents)[0]
words = [token.lemma_ for token in sentence]
print((sentence.text, words))

Actual output

('taxes are high', ['taxis', 'be', 'high'])

Info about spaCy

spaCy version: 3.2.4
Platform: Linux-5.10.0-15-amd64-x86_64-with-glibc2.31
Python version: 3.9.2
Pipelines: en_core_web_sm (3.2.0)

adrianeboyd commented 1 year ago

Thanks for the report! I double-checked the English lemmatizer data and the lemmatizer algorithm, and there's not an easy fix for this with the current algorithm, which doesn't include frequency information, and "taxis" is a possible lemma for "taxes".

If you'd like to exclude "taxis", you can modify the lemmatizer tables like this:

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_exc")["noun"]["taxes"] = ["tax"]
doc = nlp("taxes are high")
print([t.lemma_ for t in doc])

If you save this pipeline with nlp.to_disk it will include these changes to the lemmatizer tables.

jaleskovec commented 1 year ago

Thank you

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy