explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

Problems and errors in German lemmatizer #2486

Closed disimone closed 6 years ago

disimone commented 6 years ago

How to reproduce the behaviour

import spacy
nlp = spacy.load('de')
test = nlp.tokenizer('die Versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] die der 
[output] Versicherungen Versicherung

test = nlp.tokenizer('Die Versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] Die Die
[output] Versicherungen Versicherung

test = nlp.tokenizer('die versicherungen') # The insuranceS
for t in test:
    print(t,t.lemma_)
[output] die der
[output] versicherungen versicherungen

Your Environment

Hi all,

I hope the code snippet exemplifies the problem clearly enough.

Basically, I fail to see how the German lemmatization should be used.

Nouns are only lemmatized if they are Capitalized, all other text elements are only lemmatized if they are lower case. So turning all words to lower() means throwing away all nouns lemmas. Trusting the input to have proper capitalization means losing all cases where a non-noun is at the beginning of a sencence (hence not lower case).

How do people actually use this in a real use-case?

Thanks for your help,

Andrea.

DuyguA commented 6 years ago

Hello all,

I checked the German lemmatizer file, https://raw.githubusercontent.com/explosion/spaCy/master/spacy/lang/de/lemmatizer.py

Here, the string "versicherungen" does not exist as a key, hence go unrecognized.

The thing with German textt that, almost all NLP applications depends on capitals are correctly written. (as for some words writing capital/small can mess up statistical algorithms).

What about this idea @ines , can we add "safe" nouns' small versions to the lemmatizer file as well?

Cheers, Duygu.

disimone commented 6 years ago

Hi @DuyguA,

indeed, my main problem is not with Versicherungen (which exists in the lemmatizer lookup table with the correct capitalization), but the fact that "Die" is not recognized, while "die" is. In general, every time a verb/adjective/pronoun/article is at the beginning of a sentence, it will not be recognized, because the lemmatizer only knows it in lower-case.

And of course if I lower() everything, I lose all the nouns, as you pointed out. The same if I lower() only the first word of a sentence, since from time to time nouns will be there too...

It seems to me like the only correct solution compatible with the current lookup-based approach would be to add to the lookup all verbs/pronouns/articles/adjectives both with and without capitalization, and leave the nouns only with capitalization. Basically: all words in the present lookup that are not capitalized, must be duplicated in their capitalized version, the corresponding lemmas can stay lower-case. Those that are already capitalized stay as they are. One may have to take care of words that are both verbs and nouns depending on the capitalization ("leben" to live, "Leben" the life). Of couse this would increase the size of the lookup, but better a larger lemmatizer that one can use, than a smaller unusable one :)

Jm2c,

Andrea.

DuyguA commented 6 years ago

Oh sorry, I understood it all wrong.

"ein", die", "der", "das".... are always articles with or without capitalization. I think their uppercase forms can be added to the lemmatizer even without messing with the nouns and making the lemmatizer file bigger.

ines commented 6 years ago

What about this idea @ines , can we add "safe" nouns' small versions to the lemmatizer file as well?

Yes, this should be no problem, so if you want to submit a PR, that would be cool 😃

The lookup lemmatizers aren't great, and we're hoping that we'll be able to replace them with a rule-based lemmatizer like the English one soon. There have been a few other issues in that area as well (mostly with the rule-based lemmatizer and especially with German), so I'm worried that there might even be a subtle bug somewhere (see #2368 for example) 😩 So yeah, we can't wait to give the lemmatizers an overhaul.

DuyguA commented 6 years ago

I don't know why German lemmatizer has these many issues. Actually word+POS tag combo lookuo correctly covers almost all words.

Why not, I can prepare a new file but I'm afraid to make the lemmatizer file too big😶

Coming to the design issues, what about letting function calls as well as lookup? Some languages such as Turkish and Finnish can't be covered by lookup anyway.

DuyguA commented 6 years ago

Btw @disimone if you need an analyzer+lemmatizer, you can check DEMorphy. If you happen to need a list of German words with possible analysis, you can also check out german-morph-dicts. You can always mail me if you need German resources in general. My company is happy to support German language processing and help text miners.

Jean-Zombie commented 6 years ago

Hey there. I hooked the treetagger into the pipeline to shorten the waiting time until spacy's german lemmatizer catches up ;-). This is how:

import spacy
import treetaggerwrapper # the treetagger binary needs to be installed as well!

nlp = spacy.load("de_core_news_sm")
# We set a custom attribute on spacy's token class to store the lemma in later on.
# The attribute can be accessed by calling '._.lemma' on a token.
Token.set_extension("lemma", default="", force=True)

# our 'custom' lemmatizer to be added in spacy's pipeline
def lemmatizer(doc):
    tagger = treetaggerwrapper.TreeTagger(TAGLANG="de")
    for token in doc:
        try:
            token._.lemma = tagger.tag_text(token.text)[0].split("\t")[2] # we access the custom extension here 
        except Exception:
            pass
    return doc

We only need to add our custom lemmatizer to spacy’s pipeline now:

nlp.add_pipe(lemmatizer)

Et voilà:

doc = nlp("Die Versicherungen sind zu teuer.")
for w in doc:
    print(w._.lemma)
die
Versicherung
sein
zu
teuer
.
ines commented 6 years ago

Making this the master issue for everything related to the German lemmatizer, so copying over the other comments and test cases. We're currently planning out various improvements to the rule-based lemmatizer, and strategies to replace the lookup tables with rules wherever possible.

2368

doc = nlp(u'Ich sehe Bäume')

for token in doc:
    print(token.text,token.lemma, token.lemma_, token.pos_)
    print("has_vector:", token.has_vector)
doc = nlp("Diese Auskünfte muss ich dir nicht geben.")
[token.lemma_ for token in doc]
# ['Diese', 'Auskunft', 'muss', 'ich', 'sich', 'nicht', 'geben', '.']

2120

The German lemmatizer currently only uses a lookup table – that's fine for some cases, but obviously not as good as a solution that takes part-of-speech tags into account.

You might want to check out #2079, which discusses a solution for implementing a custom lemmatizer in French – either based on spaCy's English lemmatization rules, or by implementing a third-party library via a custom pipeline component.

One quick note on the expected lemmatization / tokenization:

=> unter, der, Tisch, spinnen, klapperdürr, Holzwurm

spaCy's German tokenization rules currently don't split contractions like "unterm". One reason is that spaCy will never modify the original ORTH value of the tokens – so "unterm" would have to become ["unter", "m"], where the token "m" will have the NORM "dem". Those single-letter tokens can easily lead to confusion, which is why we've opted to not produce them for now. But if your treebank or expected tokenization requires contractions to be split, you can easily add your own special case rules:

import spacy
from spacy.symbols import ORTH, NORM, LEMMA

nlp = spacy.load('de')

special_case = [{ORTH: 'unter'}, {ORTH: 'm', NORM: 'dem', LEMMA: 'der'}]
nlp.tokenizer.add_special_case('unterm', special_case)

We don't have an immediate plan or timeline yet, but we'd definitely love to move from lookup lemmatization to rule-based or statistical lemmatization in the future. (Shipping the tables with spaCy really adds a lot of bloat and it comes with all kinds of other problems.)

Laubeee commented 6 years ago

Not sure if this is already covered by other issues, but here is something i found with 2.0.11: Some kinds of words that are similar to verbs get lemmatized to the verb infinitive

import spacy
nlp = spacy.load('de')
[(t.text, t.pos_, t.lemma_) for t in nlp("Ein Beispiel in Form von Code.")]
Out[4]: 
[('Ein', 'DET', 'Ein'),
 ('Beispiel', 'NOUN', 'Beispiel'),
 ('in', 'ADP', 'in'),
 ('Form', 'NOUN', 'formen'),
 ('von', 'ADP', 'von'),
 ('Code', 'NOUN', 'Code'),
 ('.', 'PUNCT', '.')]

"Form" seems to be lemmatized thinking it is a verb resulting in "formen" although its correctly tagged as noun (and also "form" is not a correct conjugation of the verb "formen"

same happens for Nouns like "Wert", "Empfang" and seems like other words are also affected like the adjective "näher" (comparitive of "nah"):

[(t.text, t.pos_, t.lemma_) for t in nlp("Um die Problematik näher zu erläutern")]
Out[5]: 
[('Um', 'SCONJ', 'Um'),
 ('die', 'DET', 'der'),
 ('Problematik', 'NOUN', 'Problematik'),
 ('näher', 'ADJ', 'nähern'),
 ('zu', 'PART', 'zu'),
 ('erläutern', 'VERB', 'erläutern')]
DuyguA commented 6 years ago

Your observation is correct because tokens are looked up by only the token itself, not (token, POS tag) pair.

Laubeee commented 6 years ago

Really? It does take the parameter... https://spacy.io/api/lemmatizer#call

Also, then again, I don't think there is a correct usecase to lemmatize "form" to "formen", since the term "form" can never be a variation of the verb "formen". See http://konjugator.reverso.net/konjugation-deutsch-verb-formen.html

soo I guess that means the lookup-entry "form -> formen" is wrong and should be deleted?

DuyguA commented 6 years ago

Here is the lemmatizer file, see the keys:

https://github.com/explosion/spaCy/blob/master/spacy/lang/de/lemmatizer.py

Of course "Form->formen" is not possible in general. Keep in mind that most regular verb forms are generated automatically. I can delete this entry.

I'm interested in a general revision for the German lemmatizer. Do you also have other issues with the lemmatizer?

Laubeee commented 6 years ago

Well as stated already it happens with various words.. if you don't want to remove all cases by hand, maybe the generation-routine needs to be adjusted?

For a general revision, making the lookup POS-sensitive (e.g. using the POS-tag as word-prefix in all keys or at least for those POS that make sense) would obviously resolve this problem quite effectively.

Other than issues already stated in this thread it seems to work quite ok. The TreeTagger obviously performs better with words that are not in the lookup file, so for now I suppose I'll use a combination of both ;) for that matter (adding a backup lemmatizer) it would be cool to have a way to tell whether the lookup was successful or not

DuyguA commented 6 years ago

I think lookup with POS tag will solve the majority of the issues.

Btw, if you want to experiment with my lemmatizer design here he is:

https://github.com/DuyguA/DEMorphy

You can find the list of accompanying morphological dictionaries in the repo as well.

In case if you need German languahe resources you can always contact me and my colleagues at Parlamind. We're more than happy to help.

ines commented 6 years ago

Merging this with #2668!

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.