Open lutz-100worte opened 2 years ago
Thanks for the examples! You're right that this basically comes down to the training data, but we'd also like to explore combining this lemmatizer with other lemmatization approaches and lexical resources to improve cases like this.
I imagine that these example are OOV, and I do not think the errors come from errors in the training data. The new trainable lemmatizer generalizes morphological features based on examples with similar morphology. What the new lemmatizer may need is more data with more examples so that it can generalize better.
@jmyerston - I did not mean to imply that there were errors in the training data; I don't have reason to believe that. With "better" I meant mostly more data (while possibly overrepresenting irregular cases to give the model the chance to learn them).
Instead of opening a new issue, I'd rather add our comments here. We ran into the same problem with the Italian lemmatizer since spaCy v.3.3 (which introduced the EditTreeLemmatizer).
Basically, the behaviour is quite unpredictable, mainly for verbs. A few examples (same phrase, - some stains remain -, different versions and models, but I can provide many other examples):
nlp = spacy.load("it_core_news_sm") # Same results with it_core_news_md
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")
# Output:
Rimangono Rimangono VERB Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
delle di il ADP Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie macchia NOUN Gender=Fem|Number=Plur
Morphological data are correct but the lemma is not the verb in its infinitive form (i.e., Rimanere).
nlp = spacy.load("it_core_news_sm")
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")
# Output (correct)
Rimangono Rimanere VERB Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
delle di il ADP Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie macchia NOUN Gender=Fem|Number=Plur
nlp = spacy.load("it_core_news_md")
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")
# Output (wrong)
Rimangono Rimangono VERB Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin
delle di il ADP Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie macchia NOUN Gender=Fem|Number=Plur
Any suggestions? Is there any way we can help?
Thanks in advance.
The case that @vieriemiliani described is part of a bigger issue. The new EditTreeLemmatizer struggles to produce correct lemmas when words are capitalized. This becomes a big issue for sentence-initial words. Before running the EditTreeLemmatizer, the sentence-initial words should be lowercased (or the model robustly retrained for capitalized words).
import spacy
nlp = spacy.load("it_core_news_sm")
texts = [
"Rimasi con loro diversi giorni e celebrammo insieme la Commemorazione.",
"Requisiti supplementari in materia di informazioni sul prodotto relative alle lampade a LED intese a sostituire lampade fluorescenti senza alimentatore integrato",
"Scegli l'oggetto nella lista a cui vuoi assegnare il tasto di scelta rapida.",
]
for text in texts:
print("-"*3)
for t in nlp(text):
print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")
Output
Rimasi Rimasi VERB Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin
con con ADP
loro loro PRON Number=Plur|Person=3|PronType=Prs
diversi diverso DET Gender=Masc|Number=Plur|PronType=Ind
giorni giorno NOUN Gender=Masc|Number=Plur
e e CCONJ
celebrammo celebrare VERB Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin
insieme insieme ADV
la il DET Definite=Def|Gender=Fem|Number=Sing|PronType=Art
Commemorazione Commemorazione NOUN Gender=Fem|Number=Sing
. . PUNCT
---
Requisiti requisite NOUN Gender=Masc|Number=Plur
supplementari supplementare ADJ Number=Plur
in in ADP
materia materia NOUN Gender=Fem|Number=Sing
di di ADP
informazioni informazione NOUN Gender=Fem|Number=Plur
sul su il ADP Definite=Def|Gender=Masc|Number=Sing|PronType=Art
prodotto prodotto NOUN Gender=Masc|Number=Sing
relative relativo ADJ Gender=Fem|Number=Plur
alle a il ADP Definite=Def|Gender=Fem|Number=Plur|PronType=Art
lampade lampada NOUN Gender=Fem|Number=Plur
a a ADP
LED LED PROPN
intese intendere VERB Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part
a a ADP
sostituire sostituire VERB VerbForm=Inf
lampade lampada NOUN Gender=Fem|Number=Plur
fluorescenti fluorescente ADJ Number=Plur
senza senza ADP
alimentatore alimentatore NOUN Gender=Masc|Number=Sing
integrato integrato ADJ Gender=Masc|Number=Sing
---
Scegli Scegli VERB Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
l' il DET Definite=Def|Number=Sing|PronType=Art
oggetto oggetto NOUN Gender=Masc|Number=Sing
nella in il ADP Definite=Def|Gender=Fem|Number=Sing|PronType=Art
lista lista NOUN Gender=Fem|Number=Sing
a a ADP
cui cui PRON PronType=Rel
vuoi volere AUX Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
assegnare assegnare VERB VerbForm=Inf
il il DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art
tasto tasto NOUN Gender=Masc|Number=Sing
di di ADP
scelta scelta NOUN Gender=Fem|Number=Sing
rapida rapido ADJ Gender=Fem|Number=Sing
. . PUNCT
import spacy
nlp = spacy.load("it_core_news_sm")
lowercased_text = [
"rimasi con loro diversi giorni e celebrammo insieme la commemorazione.",
"requisiti supplementari in materia di informazioni sul prodotto relative alle lampade a led intese a sostituire lampade fluorescenti senza alimentatore integrato",
"scegli l'oggetto nella lista a cui vuoi assegnare il tasto di scelta rapida.",
]
for text in lowercased_text:
print("-"*3)
for t in nlp(text):
print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")
Output
---
rimasi rimarere VERB Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin
con con ADP
loro loro PRON Number=Plur|Person=3|PronType=Prs
diversi diverso DET Gender=Masc|Number=Plur|PronType=Ind
giorni giorno NOUN Gender=Masc|Number=Plur
e e CCONJ
celebrammo celebrare VERB Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin
insieme insieme ADV
la il DET Definite=Def|Gender=Fem|Number=Sing|PronType=Art
commemorazione commemorazione NOUN Gender=Fem|Number=Sing
. . PUNCT
---
requisiti requisito NOUN Gender=Masc|Number=Plur
supplementari supplementare ADJ Number=Plur
in in ADP
materia materia NOUN Gender=Fem|Number=Sing
di di ADP
informazioni informazione NOUN Gender=Fem|Number=Plur
sul su il ADP Definite=Def|Gender=Masc|Number=Sing|PronType=Art
prodotto prodotto NOUN Gender=Masc|Number=Sing
relative relativo ADJ Gender=Fem|Number=Plur
alle a il ADP Definite=Def|Gender=Fem|Number=Plur|PronType=Art
lampade lampada NOUN Gender=Fem|Number=Plur
a a ADP
led Led NOUN Gender=Fem|Number=Plur
intese intesa NOUN Gender=Fem|Number=Plur
a a ADP
sostituire sostituire VERB VerbForm=Inf
lampade lampada NOUN Gender=Fem|Number=Plur
fluorescenti fluorescente ADJ Number=Plur
senza senza ADP
alimentatore alimentatore NOUN Gender=Masc|Number=Sing
integrato integrato ADJ Gender=Masc|Number=Sing
---
scegli scegliere VERB Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
l' il DET Definite=Def|Number=Sing|PronType=Art
oggetto oggetto NOUN Gender=Masc|Number=Sing
nella in il ADP Definite=Def|Gender=Fem|Number=Sing|PronType=Art
lista lista NOUN Gender=Fem|Number=Sing
a a ADP
cui cui PRON PronType=Rel
vuoi volere AUX Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
assegnare assegnare VERB VerbForm=Inf
il il DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art
tasto tasto NOUN Gender=Masc|Number=Sing
di di ADP
scelta scelta NOUN Gender=Fem|Number=Sing
rapida rapido ADJ Gender=Fem|Number=Sing
. . PUNCT
I was curious about how spacy and some existing models would perform with the sentences given in the issue description, so below is a short script that finds the lemma for each sentence, using various models. Perhaps this will be useful (?):
Output:
sentence | de_core_news_sm | de_dep_news_trf | de_core_news_md | de_core_news_lg |
---|---|---|---|---|
Wir Königinnen dürfen nicht nach unsen Herzen wählen ... | Königin | Königin | Königinne | Königinne |
Du kannst froh sein , wenn du nicht Bartgesicht Kennedy verlierst ! | Kannst | kannen | kannst | kannst |
Leise , du störst mich . | störst | stören | störsen | störn |
Du sorgst dich um mich ? | sorgen | sorgen | sorgstn | sorgen |
Du überzeugst uns durch deine analytischen und konzeptionellen Fähigkeiten | überzeugen | überzeugen | überzeugstn | überzeugsen |
Weiterhin erfüllst Du folgende Anforderungen : | erfüllst | erfüllen | erfüllen | erfüllsen |
Du stärkst Selbstorganisation und Eigenverantwortlichkeit deines Teams . | stärken | stärken | stärkst | stärksen |
Er zitterte vor Sorge . | zitteren | zittern | zitteren | zitteren |
Entschuldigung , dass ich Sie solange aufhalte , aber ... | aufhalt | aufhalten | aufhalen | aufhalten |
Der Gärtner , den sie hatten , verstünde nichts . | verstünde | verstehen | verstünden | verstünden |
So etwas lächerliches zu erfinden , ich schäme mich für Sie . | lächerlich | lächerliche | lächerlicher | lächerlicher |
Du kümmerst Dich nach Absprache mit um unsere Social Media Tools . | kümmerst | kümmeren | kümmerst | kümmeren |
Und du in meinen Träumen . | Träum | Traum | Traum | Traum |
Aber ich merke nichts davon , dass du mit mir ausgehst . | ausgehst | ausgehen | ausgehst | ausgehen |
Mac ventura, python 3.11.3, spacy==3.6.1
For some context, here was the master issue for problems in lemmatization for the lookup-based lemmatizer for German: https://github.com/explosion/spaCy/issues/2486 And here was the announcement that German would be prioritized for improving lemmatization, among other languages: https://github.com/explosion/spaCy/issues/2668 And since version 3.3.0, the default lemmatizer in German is a edit tree lemmatizer. The accuracy went from 73.43% in the medium pipeline version 3.0.0 to 97.71% in 3.3.0, which is of course, amazing.
To start documenting its shortcomings, I list some error cases. Of course, this new lemmatizer is trainable and I could just go ahead and train a new version myself. I don't intend to do that. I hope someone else can benefit from looking into these specific errors, and they may reveal fixable patterns that can be addressed with a "better" training data set.