Some tokenizer exceptions not being applied

explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

https://spacy.io

MIT License

29.82k stars 4.37k forks source link

Some tokenizer exceptions not being applied #13126

Closed DavidRamosSal closed 10 months ago

DavidRamosSal commented 11 months ago

It seems like some of tokenizer exemption rules for the english models are not being applied. In the example below "yall" is correctly lemmatized as "you" + "all" but "doin" is lemmatized as "doin".

How to reproduce the behaviour

import spacy

nlp = spacy.load("en_core_web_md")

string = "How are yall doin?"

for tok in nlp(string):
    print(tok.text, tok.lemma_)

Output

How how
are be
y you
all all
doin doin
? ?

Environment

spaCy version: 3.7.2
Platform: Linux-6.5.10-200.fc38.x86_64-x86_64-with-glibc2.37
Python version: 3.10.12
Pipelines: en_core_web_md (3.7.0)

adrianeboyd commented 11 months ago

In spacy v2 the tokenizer and lemmatizer exceptions were both maintained as part of the tokenizer, but now in v3 the tokenizer exceptions are in the tokenizer settings and the lemmatizer settings are in the attribute_ruler. If you want to see all the details, look at the rules in nlp.tokenizer.rules and nlp.get_pipe("attribute_ruler").patterns. If you'd like, you can customize any of these rules/patterns in the en_core_web_* pipelines as necessary for your task.

The existing lemmatizer exception only applies to "doin'" and not "doin", but looking again at the details there are some mistakes in the current rules, which have the lemma as "doing" instead of "do", so we'll get that updated for the next model release (probably v3.8.0).

DavidRamosSal commented 11 months ago

The existing lemmatizer exception for "doin'" is not working for me either. This must be related to #13098 because after adding

nlp.get_pipe("attribute_ruler").add(patterns = [[{'LOWER': "doin'"}]], attrs= {'LEMMA': 'do'}, index= 0)

the lemmatization works as intended.

adrianeboyd commented 11 months ago

Ah, thanks for the note about that. My intention was that NORM would help here with doin' vs. doin’, but the tokenizer exceptions are more complicated than I remembered. I'll just revert all the NORM changes back to LOWER for the next model release so it goes back to the v3.6.0 behavior. (If these changes are causing problems for you, it should work fine to use the v3.6.0 en_core* models with v3.7.0.)

DavidRamosSal commented 11 months ago

Thanks for the help, the v3.6.0 en_core* models indeed work better for my task.

adrianeboyd commented 10 months ago

I just went ahead and published updated v3.7 en models, which now should be available through spacy download.

github-actions[bot] commented 10 months ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 9 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.