leammtizer issue for german words

explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

https://spacy.io

MIT License

30.06k stars 4.4k forks source link

leammtizer issue for german words #2368

Closed ctrado18 closed 6 years ago

ctrado18 commented 6 years ago

I am confused about the lemmatizer. For a sentence Ich sehe Bäume (I see trees).

nlp = spacy.load('de_core_news_sm')
doc = nlp(u'Ich sehe Bäume')

for token in doc:
    print(token.text,token.lemma, token.lemma_, token.pos_)
    print("has_vector:", token.has_vector)

token.lemma is just Bäume. I thought it would be lemmatized to the singular form Baum (tree)?

ines commented 6 years ago

Yes, Baum would definitely be correct here. The German lemmatizer only uses lookup tables (and no rule-based process like the English one). This has some limitations – I've written a bit more about this in my comment on this thread.

Another problem is that spaCy will always decide on one lemma (and won't just give you a bunch of options to choose from). This is convenient – but it also means that if the one pick has to be correct. That said, there's definitely been some suspicious reports around the lemmatization performance that might indicate a bug.

In the meantime, you might want to check out the spacy-iwnlp extensions by @Liebeck and see how it performs on your use case!

fotisj commented 6 years ago

I agree with ines: the German lemmatizer gives strange results on trivial texts, for example:

doc = nlp("Diese Auskünfte muss ich dir nicht geben.")
[token.lemma_ for token in doc]

results in:

['Diese', 'Auskunft', 'muss', 'ich', 'sich', 'nicht', 'geben', '.']

The lemma for 'muss' should be 'müssen' Actually the same task done by the Treetagger, which usually works ok, looks like this:

['dies', 'Auskunft', 'müssen', 'ich', 'du', 'nicht', 'geben']

So it seems that lemmatization for German in Spacy is not really useable at the moment.

I had a look at spacy-iwnlp. Strangely it only handles parts of the sentence. So its output for the sentence above looks like this:

[None, ['Auskunft'], ['müssen'], None, None, None, ['geben'], None]

Liebeck commented 6 years ago

@fotisj Yes, spacy-iwnlp only handles nouns, verbs, and adjectives. Pronouns are currently not included since they are more difficult to parse and they are mostly filtered out by stopword lists anyway.

If you want to use lemmas with IWNLP, you need to select one of the lemmas predicted by IWNLP or select the raw text (or its lowercased version) if the lemma is None.

ines commented 6 years ago

Merging this with the master issue in #2486!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.