Closed DavidRamosSal closed 10 months ago
In spacy v2 the tokenizer and lemmatizer exceptions were both maintained as part of the tokenizer, but now in v3 the tokenizer exceptions are in the tokenizer settings and the lemmatizer settings are in the attribute_ruler
. If you want to see all the details, look at the rules in nlp.tokenizer.rules
and nlp.get_pipe("attribute_ruler").patterns
. If you'd like, you can customize any of these rules/patterns in the en_core_web_*
pipelines as necessary for your task.
The existing lemmatizer exception only applies to "doin'"
and not "doin"
, but looking again at the details there are some mistakes in the current rules, which have the lemma as "doing" instead of "do", so we'll get that updated for the next model release (probably v3.8.0).
The existing lemmatizer exception for "doin'"
is not working for me either. This must be related to #13098 because after adding
nlp.get_pipe("attribute_ruler").add(patterns = [[{'LOWER': "doin'"}]], attrs= {'LEMMA': 'do'}, index= 0)
the lemmatization works as intended.
Ah, thanks for the note about that. My intention was that NORM
would help here with doin'
vs. doin’
, but the tokenizer exceptions are more complicated than I remembered. I'll just revert all the NORM
changes back to LOWER
for the next model release so it goes back to the v3.6.0 behavior. (If these changes are causing problems for you, it should work fine to use the v3.6.0 en_core*
models with v3.7.0.)
Thanks for the help, the v3.6.0 en_core*
models indeed work better for my task.
I just went ahead and published updated v3.7 en
models, which now should be available through spacy download
.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
It seems like some of tokenizer exemption rules for the english models are not being applied. In the example below "yall" is correctly lemmatized as "you" + "all" but "doin" is lemmatized as "doin".
How to reproduce the behaviour
Output
Environment