Closed jaleskovec closed 1 year ago
Thanks for the report! I double-checked the English lemmatizer data and the lemmatizer algorithm, and there's not an easy fix for this with the current algorithm, which doesn't include frequency information, and "taxis" is a possible lemma for "taxes".
If you'd like to exclude "taxis", you can modify the lemmatizer tables like this:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_exc")["noun"]["taxes"] = ["tax"]
doc = nlp("taxes are high")
print([t.lemma_ for t in doc])
If you save this pipeline with nlp.to_disk
it will include these changes to the lemmatizer tables.
Thank you
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Description of issue
I'm not sure if this should be an accepted side-effect of the model or not so I apologize in advance if this is not considered a bug. When obtaining the lemma for "taxes", it returns "taxis".
How to reproduce the behaviour
Actual output
Info about spaCy