Closed ctrado18 closed 6 years ago
Yes, Baum
would definitely be correct here. The German lemmatizer only uses lookup tables (and no rule-based process like the English one). This has some limitations – I've written a bit more about this in my comment on this thread.
Another problem is that spaCy will always decide on one lemma (and won't just give you a bunch of options to choose from). This is convenient – but it also means that if the one pick has to be correct. That said, there's definitely been some suspicious reports around the lemmatization performance that might indicate a bug.
In the meantime, you might want to check out the spacy-iwnlp
extensions by @Liebeck and see how it performs on your use case!
I agree with ines: the German lemmatizer gives strange results on trivial texts, for example:
doc = nlp("Diese Auskünfte muss ich dir nicht geben.")
[token.lemma_ for token in doc]
results in:
['Diese', 'Auskunft', 'muss', 'ich', 'sich', 'nicht', 'geben', '.']
The lemma for 'muss' should be 'müssen' Actually the same task done by the Treetagger, which usually works ok, looks like this:
['dies', 'Auskunft', 'müssen', 'ich', 'du', 'nicht', 'geben']
So it seems that lemmatization for German in Spacy is not really useable at the moment.
I had a look at spacy-iwnlp. Strangely it only handles parts of the sentence. So its output for the sentence above looks like this:
[None, ['Auskunft'], ['müssen'], None, None, None, ['geben'], None]
@fotisj Yes, spacy-iwnlp only handles nouns, verbs, and adjectives. Pronouns are currently not included since they are more difficult to parse and they are mostly filtered out by stopword lists anyway.
If you want to use lemmas with IWNLP, you need to select one of the lemmas predicted by IWNLP or select the raw text (or its lowercased version) if the lemma is None.
Merging this with the master issue in #2486!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I am confused about the lemmatizer. For a sentence
Ich sehe Bäume
(I see trees).token.lemma is just
Bäume
. I thought it would be lemmatized to the singular formBaum
(tree)?