explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.32k stars 4.41k forks source link

Problems and errors in new German lemmatizer (since 3.3.0) #10953

Open lutz-100worte opened 2 years ago

lutz-100worte commented 2 years ago

For some context, here was the master issue for problems in lemmatization for the lookup-based lemmatizer for German: https://github.com/explosion/spaCy/issues/2486 And here was the announcement that German would be prioritized for improving lemmatization, among other languages: https://github.com/explosion/spaCy/issues/2668 And since version 3.3.0, the default lemmatizer in German is a edit tree lemmatizer. The accuracy went from 73.43% in the medium pipeline version 3.0.0 to 97.71% in 3.3.0, which is of course, amazing.

To start documenting its shortcomings, I list some error cases. Of course, this new lemmatizer is trainable and I could just go ahead and train a new version myself. I don't intend to do that. I hope someone else can benefit from looking into these specific errors, and they may reveal fixable patterns that can be addressed with a "better" training data set.

sentence predicted lemma
Wir Königinnen dürfen nicht nach unsen Herzen wählen ... königinn
Du kannst froh sein , wenn du nicht Bartgesicht Kennedy verlierst ! kannsen
Leise , du störst mich . störstn
Du sorgst dich um mich ? sorgstn
Du überzeugst uns durch deine analytischen und konzeptionellen Fähigkeiten überzeugstn
Weiterhin erfüllst Du folgende Anforderungen : erfüllsen
Du stärkst Selbstorganisation und Eigenverantwortlichkeit deines Teams . stäreksten (!)
Er zitterte vor Sorge . zitteren
Entschuldigung , dass ich Sie solange aufhalte , aber ... aufhaln (!)
Der Gärtner , den sie hatten , verstünde nichts . verstünden
So etwas lächerliches zu erfinden , ich schäme mich für Sie . lächerlicher
Du kümmerst Dich nach Absprache mit um unsere Social Media Tools . kümmeren
Und du in meinen Träumen . träum
Aber ich merke nichts davon , dass du mit mir ausgehst . aushsen (!!)
adrianeboyd commented 2 years ago

Thanks for the examples! You're right that this basically comes down to the training data, but we'd also like to explore combining this lemmatizer with other lemmatization approaches and lexical resources to improve cases like this.

jmyerston commented 2 years ago

I imagine that these example are OOV, and I do not think the errors come from errors in the training data. The new trainable lemmatizer generalizes morphological features based on examples with similar morphology. What the new lemmatizer may need is more data with more examples so that it can generalize better.

lutz-100worte commented 2 years ago

@jmyerston - I did not mean to imply that there were errors in the training data; I don't have reason to believe that. With "better" I meant mostly more data (while possibly overrepresenting irregular cases to give the model the chance to learn them).

vieriemiliani commented 2 years ago

Instead of opening a new issue, I'd rather add our comments here. We ran into the same problem with the Italian lemmatizer since spaCy v.3.3 (which introduced the EditTreeLemmatizer).

Basically, the behaviour is quite unpredictable, mainly for verbs. A few examples (same phrase, - some stains remain -, different versions and models, but I can provide many other examples):

spaCy v.3.3.0

nlp = spacy.load("it_core_news_sm")   # Same results with it_core_news_md
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")

# Output:
Rimangono                Rimangono                VERB     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
delle                    di il                    ADP      Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie                  macchia                  NOUN     Gender=Fem|Number=Plur

Morphological data are correct but the lemma is not the verb in its infinitive form (i.e., Rimanere).

spaCy v3.4.0

nlp = spacy.load("it_core_news_sm") 
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")

# Output (correct)
Rimangono                Rimanere                 VERB     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
delle                    di il                    ADP      Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie                  macchia                  NOUN     Gender=Fem|Number=Plur

nlp = spacy.load("it_core_news_md")
doc = nlp("Rimangono delle macchie")
for t in doc: print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")

# Output (wrong)
Rimangono                Rimangono                VERB     Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin
delle                    di il                    ADP      Definite=Def|Gender=Fem|Number=Plur|PronType=Art
macchie                  macchia                  NOUN     Gender=Fem|Number=Plur

Any suggestions? Is there any way we can help?

Thanks in advance.

aflueckiger commented 2 years ago

The case that @vieriemiliani described is part of a bigger issue. The new EditTreeLemmatizer struggles to produce correct lemmas when words are capitalized. This becomes a big issue for sentence-initial words. Before running the EditTreeLemmatizer, the sentence-initial words should be lowercased (or the model robustly retrained for capitalized words).

spaCy v3.4.0 - regular text

import spacy
nlp = spacy.load("it_core_news_sm")

texts = [
    "Rimasi con loro diversi giorni e celebrammo insieme la Commemorazione.",
    "Requisiti supplementari in materia di informazioni sul prodotto relative alle lampade a LED intese a sostituire lampade fluorescenti senza alimentatore integrato",
    "Scegli l'oggetto nella lista a cui vuoi assegnare il tasto di scelta rapida.",
]

for text in texts:
    print("-"*3)
    for t in nlp(text): 
        print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")

Output

Rimasi                   Rimasi                   VERB     Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin
con                      con                      ADP                      
loro                     loro                     PRON     Number=Plur|Person=3|PronType=Prs
diversi                  diverso                  DET      Gender=Masc|Number=Plur|PronType=Ind
giorni                   giorno                   NOUN     Gender=Masc|Number=Plur
e                        e                        CCONJ                    
celebrammo               celebrare                VERB     Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin
insieme                  insieme                  ADV                      
la                       il                       DET      Definite=Def|Gender=Fem|Number=Sing|PronType=Art
Commemorazione           Commemorazione           NOUN     Gender=Fem|Number=Sing
.                        .                        PUNCT                    
---
Requisiti                requisite                NOUN     Gender=Masc|Number=Plur
supplementari            supplementare            ADJ      Number=Plur     
in                       in                       ADP                      
materia                  materia                  NOUN     Gender=Fem|Number=Sing
di                       di                       ADP                      
informazioni             informazione             NOUN     Gender=Fem|Number=Plur
sul                      su il                    ADP      Definite=Def|Gender=Masc|Number=Sing|PronType=Art
prodotto                 prodotto                 NOUN     Gender=Masc|Number=Sing
relative                 relativo                 ADJ      Gender=Fem|Number=Plur
alle                     a il                     ADP      Definite=Def|Gender=Fem|Number=Plur|PronType=Art
lampade                  lampada                  NOUN     Gender=Fem|Number=Plur
a                        a                        ADP                      
LED                      LED                      PROPN                    
intese                   intendere                VERB     Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part
a                        a                        ADP                      
sostituire               sostituire               VERB     VerbForm=Inf    
lampade                  lampada                  NOUN     Gender=Fem|Number=Plur
fluorescenti             fluorescente             ADJ      Number=Plur     
senza                    senza                    ADP                      
alimentatore             alimentatore             NOUN     Gender=Masc|Number=Sing
integrato                integrato                ADJ      Gender=Masc|Number=Sing
---
Scegli                   Scegli                   VERB     Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
l'                       il                       DET      Definite=Def|Number=Sing|PronType=Art
oggetto                  oggetto                  NOUN     Gender=Masc|Number=Sing
nella                    in il                    ADP      Definite=Def|Gender=Fem|Number=Sing|PronType=Art
lista                    lista                    NOUN     Gender=Fem|Number=Sing
a                        a                        ADP                      
cui                      cui                      PRON     PronType=Rel    
vuoi                     volere                   AUX      Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
assegnare                assegnare                VERB     VerbForm=Inf    
il                       il                       DET      Definite=Def|Gender=Masc|Number=Sing|PronType=Art
tasto                    tasto                    NOUN     Gender=Masc|Number=Sing
di                       di                       ADP                      
scelta                   scelta                   NOUN     Gender=Fem|Number=Sing
rapida                   rapido                   ADJ      Gender=Fem|Number=Sing
.                        .                        PUNCT  

spaCy v3.4.0 - lowercased text

import spacy
nlp = spacy.load("it_core_news_sm")

lowercased_text =  [
    "rimasi con loro diversi giorni e celebrammo insieme la commemorazione.",
    "requisiti supplementari in materia di informazioni sul prodotto relative alle lampade a led intese a sostituire lampade fluorescenti senza alimentatore integrato",
    "scegli l'oggetto nella lista a cui vuoi assegnare il tasto di scelta rapida.",
]

for text in lowercased_text:
    print("-"*3)
    for t in nlp(text): 
        print(f"{t.text:24}", f"{t.lemma_:24}", f"{t.pos_:8}", f"{str(t.morph):16}")

Output

---
rimasi                   rimarere                 VERB     Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin
con                      con                      ADP                      
loro                     loro                     PRON     Number=Plur|Person=3|PronType=Prs
diversi                  diverso                  DET      Gender=Masc|Number=Plur|PronType=Ind
giorni                   giorno                   NOUN     Gender=Masc|Number=Plur
e                        e                        CCONJ                    
celebrammo               celebrare                VERB     Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin
insieme                  insieme                  ADV                      
la                       il                       DET      Definite=Def|Gender=Fem|Number=Sing|PronType=Art
commemorazione           commemorazione           NOUN     Gender=Fem|Number=Sing
.                        .                        PUNCT                    
---
requisiti                requisito                NOUN     Gender=Masc|Number=Plur
supplementari            supplementare            ADJ      Number=Plur     
in                       in                       ADP                      
materia                  materia                  NOUN     Gender=Fem|Number=Sing
di                       di                       ADP                      
informazioni             informazione             NOUN     Gender=Fem|Number=Plur
sul                      su il                    ADP      Definite=Def|Gender=Masc|Number=Sing|PronType=Art
prodotto                 prodotto                 NOUN     Gender=Masc|Number=Sing
relative                 relativo                 ADJ      Gender=Fem|Number=Plur
alle                     a il                     ADP      Definite=Def|Gender=Fem|Number=Plur|PronType=Art
lampade                  lampada                  NOUN     Gender=Fem|Number=Plur
a                        a                        ADP                      
led                      Led                      NOUN     Gender=Fem|Number=Plur
intese                   intesa                   NOUN     Gender=Fem|Number=Plur
a                        a                        ADP                      
sostituire               sostituire               VERB     VerbForm=Inf    
lampade                  lampada                  NOUN     Gender=Fem|Number=Plur
fluorescenti             fluorescente             ADJ      Number=Plur     
senza                    senza                    ADP                      
alimentatore             alimentatore             NOUN     Gender=Masc|Number=Sing
integrato                integrato                ADJ      Gender=Masc|Number=Sing
---
scegli                   scegliere                VERB     Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
l'                       il                       DET      Definite=Def|Number=Sing|PronType=Art
oggetto                  oggetto                  NOUN     Gender=Masc|Number=Sing
nella                    in il                    ADP      Definite=Def|Gender=Fem|Number=Sing|PronType=Art
lista                    lista                    NOUN     Gender=Fem|Number=Sing
a                        a                        ADP                      
cui                      cui                      PRON     PronType=Rel    
vuoi                     volere                   AUX      Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin
assegnare                assegnare                VERB     VerbForm=Inf    
il                       il                       DET      Definite=Def|Gender=Masc|Number=Sing|PronType=Art
tasto                    tasto                    NOUN     Gender=Masc|Number=Sing
di                       di                       ADP                      
scelta                   scelta                   NOUN     Gender=Fem|Number=Sing
rapida                   rapido                   ADJ      Gender=Fem|Number=Sing
.                        .                        PUNCT                    
jzohrab commented 1 year ago

I was curious about how spacy and some existing models would perform with the sentences given in the issue description, so below is a short script that finds the lemma for each sentence, using various models. Perhaps this will be useful (?):

Python script ``` import sys import re import spacy models = [ 'de_core_news_sm', "de_dep_news_trf", "de_core_news_md", "de_core_news_lg" ] sentences = [ "Wir **Königinnen** dürfen nicht nach unsen Herzen wählen ...", "Du **kannst** froh sein , wenn du nicht Bartgesicht Kennedy verlierst !", "Leise , du **störst** mich .", "Du **sorgst** dich um mich ?", "Du **überzeugst** uns durch deine analytischen und konzeptionellen Fähigkeiten", "Weiterhin **erfüllst** Du folgende Anforderungen :", "Du **stärkst** Selbstorganisation und Eigenverantwortlichkeit deines Teams .", "Er **zitterte** vor Sorge .", "Entschuldigung , dass ich Sie solange **aufhalte** , aber ...", "Der Gärtner , den sie hatten , **verstünde** nichts .", "So etwas **lächerliches** zu erfinden , ich schäme mich für Sie .", "Du **kümmerst** Dich nach Absprache mit um unsere Social Media Tools .", "Und du in meinen **Träumen** .", "Aber ich merke nichts davon , dass du mit mir **ausgehst** ." ] print('|' + ' | '.join(['sentence', 'word', *models]) + '|') print('| --- | --- | --- | --- | --- |') nlps = {} for m in models: # print(f"loading {m}") nlps[m] = spacy.load(m) def print_line(sentence): word = re.sub('^.*?\*\*', '', sentence) word = re.sub('\*\*.*$', '', word) newsentence = sentence.replace('**', '') lems = [ [ t.lemma_ for t in nlps[m](newsentence) if t.text == word ] for m in models ] lems = [ lem[0] for lem in lems ] print(f"| {sentence} | " + ' | '.join(lems) + ' |') for s in sentences: print_line(s) ```

Output:

sentence de_core_news_sm de_dep_news_trf de_core_news_md de_core_news_lg
Wir Königinnen dürfen nicht nach unsen Herzen wählen ... Königin Königin Königinne Königinne
Du kannst froh sein , wenn du nicht Bartgesicht Kennedy verlierst ! Kannst kannen kannst kannst
Leise , du störst mich . störst stören störsen störn
Du sorgst dich um mich ? sorgen sorgen sorgstn sorgen
Du überzeugst uns durch deine analytischen und konzeptionellen Fähigkeiten überzeugen überzeugen überzeugstn überzeugsen
Weiterhin erfüllst Du folgende Anforderungen : erfüllst erfüllen erfüllen erfüllsen
Du stärkst Selbstorganisation und Eigenverantwortlichkeit deines Teams . stärken stärken stärkst stärksen
Er zitterte vor Sorge . zitteren zittern zitteren zitteren
Entschuldigung , dass ich Sie solange aufhalte , aber ... aufhalt aufhalten aufhalen aufhalten
Der Gärtner , den sie hatten , verstünde nichts . verstünde verstehen verstünden verstünden
So etwas lächerliches zu erfinden , ich schäme mich für Sie . lächerlich lächerliche lächerlicher lächerlicher
Du kümmerst Dich nach Absprache mit um unsere Social Media Tools . kümmerst kümmeren kümmerst kümmeren
Und du in meinen Träumen . Träum Traum Traum Traum
Aber ich merke nichts davon , dass du mit mir ausgehst . ausgehst ausgehen ausgehst ausgehen

Mac ventura, python 3.11.3, spacy==3.6.1