explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

German adjectives ending on `-e` are not lemmatized using the lookup lemmatizer #4622

Open SuzanaK opened 4 years ago

SuzanaK commented 4 years ago

How to reproduce the behaviour

import spacy
nlp = spacy.load('de')
s1 = 'Der schöne Garten'                                                                                                                                                             
doc = nlp(s1)                                                                                                                                                                        
[(t, t.lemma_) for t in doc]                                                                                                                                                         
 >> [(Der, 'der'), (schöne, 'schöne'), (Garten, 'Garten')]

s2 = 'Ein schöner Garten'  
doc = nlp(s2)                                                                                                                                                                        
[(t, t.lemma_) for t in doc]                                                                                                                                                         
>> [(Ein, 'Ein'), (schöner, 'schön'), (Garten, 'Garten')]

My Environment

Reason

As far as I can see, all forms of German adjectives ending on e in spacy-lookups-data/spacy_lookups_data/data/de_lemma_lookup.json are capitalized, e.g.:

"Dekorative": "dekorativ",
"Weiße": "Weiß",
"Schöne": "Schönes",
adrianeboyd commented 4 years ago

The lookup tables, while sometimes better than nothing, are pretty terrible. They don't take any context into account and are very unpredictable / brittle. Many adjectives ending in -e are there, so it's all kind of strange. I'd recommend an alternate lemmatizer for German for now, see #2668 for some suggestions.

SuzanaK commented 4 years ago

Hi @adrianeboyd, I've started with some tests today for a rule based lemmatizer and would like to propose a PR soon. Will we still maintain the lookup table afterwards? Do lookup table have precedence over the rule based lemmatizing? Or would all words that are already covered by a rule be removed from the lookup table to make it smaller?

adrianeboyd commented 4 years ago

A PR for this would be great! You might want to get in touch with Guadalupe Romero (@guadi1994), who has started working on this for Spanish and German.

The rule-based lemmatizer requires tags from the tagger, so the lookup table is used as a backup to use when no tags are available. The rules should have precedence over the table and I think that if there are rules, the lookup table is not used at all, but I might be mistaken.

Since it's used as a backup, it would probably make sense to fix some of the really weird closed class errors in the table, like "er" -> "ich". (We do have plans to add statisticals models for morphology and lemmatization, which could hopefully replace all of this, but it's all still in progress.)

SuzanaK commented 4 years ago

You are right, the lookup is ignored as soon as there are rules. That means I can't have rules and enhance them gradually but have to develop rules and add all exceptions (and esp. for the nouns, there will be many) to the exceptions list. I'd also have to write an extra lemmatizing method because the standard lemmatize method would change all nouns to lower case, which won't work for German. I won't be able to do that in the next time but I'll try to fix the worst errors in the lookup table.

EBoiSha commented 2 years ago

If someone can let me know the following:

Is this here still an issue? Where is the file referenced in the initial comment?

Then I'd like to take care of this issue.

lg

polm commented 2 years ago

I don't think this has been addressed yet. The data is in this repo if you want to have a look at it.

polm commented 2 years ago

Let me also link in this more recent issue about German lemmas: https://github.com/explosion/spaCy/issues/9799

EBoiSha commented 2 years ago

Okay, at least the issue mentioned in this thread, I can't find it. The file has also been updated after this issue here has been opened.

Is there any way to confirm if this issue is still up to date? It appears that it can be closed but I can not tell for sure.

EBoiSha commented 2 years ago

Okay, I think due to https://explosion.ai/blog/edit-tree-lemmatizer we could close this task or at least additional work would not make that much sense if lookup tables can be avoided

adrianeboyd commented 2 years ago

Yes, we're hoping to be able to include the edit tree lemmatizer in an upcoming release (probably v3.3). There are still cases where a lookup table can make sense, so we don't necessarily want to abandon all related issues. For most users, additional work on the lookup table wouldn't make sense right now.

SuzanaK commented 2 years ago

Sorry for my late reply. I had not continued on the rule based lemmatizer for German because I was informed that ML lemmatizers are coming soon. If anybody is interested, here are the rules - but a lot of exceptions are still missing:

https://github.com/SuzanaK/spacy-lookups-data/commit/0ee4083a1609f1dd96ee41907c1d398c09dd52f3

jzohrab commented 11 months ago

Testing with the latest spacy release in a new venv, this may have been fixed:

Setup:

python3 -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download de_core_news_sm

Test:

import spacy
nlp = spacy.load('de_core_news_sm')

def print_toks(sentence):
    print(f"\n{sentence}:")
    doc = nlp(sentence)
    print([(t, t.lemma_) for t in doc])

print_toks('Der schöne Garten')
print_toks('Ein schöner Garten')

Gives

Der schöne Garten:
[(Der, 'der'), (schöne, 'schön'), (Garten, 'Garten')]

Ein schöner Garten:
[(Ein, 'ein'), (schöner, 'schön'), (Garten, 'Garten')]

spaCy version: spacy==3.6.1 Platform: Apple M2 Pro Ventura 13.4.1 (22F82) Python version: 3.11.3 Models: de_core_news_sm

jzohrab commented 11 months ago

If anyone wants to test this out with other sentences, a better script is included in issue 10953, or you can drop the sentences here (marking the words you want to check with "*" before and after, eg: "Der **schöne*\ Garten"). 👋

adrianeboyd commented 11 months ago

spacy v3.3+ switches a number of languages to the trainable edit tree lemmatizer, so the default lemmatizer output will be different than what was discussed in the original post.

In general, some forms will be better than the lookup lemmatizer (probably most adjectives) and some will be worse (2nd person verbs that are rare in the training data). You may need to evaluate both for your task to see which is more suitable, or still consider third-party lemmatizers.

The German lookup tables in spacy-lookups-data haven't been improved (they're still kind of terrible), but to clarify this issue I'll update the issue title.