explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

Lemmatizer in French not getting the right lemma for some Verbs. #7320

Open ioExpander opened 3 years ago

ioExpander commented 3 years ago

Hi. Here is an issue I'm getting using some French pipelines (fr_core_news_lg or fr_dep_news_trf). As you can see it works in some cases but fetches the wrong lemma in some other cases. So far I've only been able to reproduce the issue with some verbs that all are from the same group (called 'first group' - ending in "er"). But not all of them have the issue as you can see in example 2. The verbs are detected properly, even with the right tense. But the lemma is missing the trailing "r" in a lot of cases.

At quick lookup against a verb dictionary could work around the issue, but I would rather help fix the root cause here :)

Thank you.

How to reproduce the behaviour

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])

#1
doc =nlp("le chat dort dans son lit")
print(*[t.lemma_ for t in doc]) # Correct
# Output : le chat dormir dans son lit

#2
doc =nlp("le chat mange des souris")
print(*[t.lemma_ for t in doc]) # Correct
# output : le chat manger un souris

#3
doc =nlp("le chat monte les escaliers")
print(*[t.lemma_ for t in doc]) # Incorrect
# output : le chat monte le escalier
# Should be : le chat monter le escalier

#4
doc =nlp("le chat saute haut")
print(*[t.lemma_ for t in doc]) # Incorrect
# Output : le chat saute haut 
# Should be : le chat sauter haut

Info about spaCy

adrianeboyd commented 3 years ago

Hi, it does look like there might be a rule for e -> er that's missing from the French lemmatizer rules:

https://github.com/explosion/spacy-lookups-data/blob/544a965501f06f55349e7402e80d6a49bc4cb3cd/spacy_lookups_data/data/fr_lemma_rules.json#L79-L125

My French is not that great, so I'm not sure whether this might cause problems for other verbs in some way, but you can add a rule to try it out like this:

nlp = spacy.load("fr_core_news_sm")
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]
assert [t.lemma_ for t in nlp("le chat monte les escaliers")] == ['le', 'chat', 'monter', 'le', 'escalier']

The lemmatizer depends on the POS annotation, so you still might see lemma errors that are caused by morphologizer errors rather than lemmatizer problems.

ioExpander commented 3 years ago

Hi. Thank you for the feedback. I ran some tests using the additional lemma rule that you suggested. Indeed it seems to solve the issue in my examples. I'm trying to figure out if this rule can be generalized or if there could be some exceptions of a French verb ending with -e and without it's infinitive form in -er. Also wondering why iI did not get the issue with the missing rule in example 2 : mange -> "manger" as a verb.

I did also find a strange issue when running the tests, as if the lemma inference were cached between different sentences. So if i recognize the verb "monte" (which is incorrect) first, and then add the lemma_rule [['e', 'er']] the next sentence is still inferred as "monte" instead of "monter". Will try to investigate more on this one too later today hopefully.

adrianeboyd commented 3 years ago

There is a lemmatizer cache that would cause this behavior. You can clear it (just by hand: nlp.get_pipe("lemmatizer").cache = {}) or save and reload the pipeline.

ioExpander commented 3 years ago

oh. cool. Thanks ! Will stop looking at that second one then !

There is a lemmatizer cache that would cause this behavior. You can clear it (just by hand: nlp.get_pipe("lemmatizer").cache = {}) or save and reload the pipeline.

ioExpander commented 3 years ago

Hi again. Ran a few tests and could not find a single broken verb detection after adding the lemma_rule. I'm not an expert, but native speaking in French so here a few examples. The additional rule fixes lemma inference for very basic verbs like "to jump" or "to climb" so I would add it to in the code base. It is a single line change, but I can open a PR if you want.

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]

def test_verb(sentence, correct_value):
  nlp.get_pipe("lemmatizer").cache = {}
  doc = nlp(sentence)
  assert doc[2].pos_ == 'VERB'
  assert doc[2].lemma_ == correct_value
  print(f"OK - {correct_value}")

test_cases = [
              #[ sentence, correct_value],
              ["le chat mange du pain", "manger"],
              ["le chat dormait dans son lit", "dormir"],
              ["le chat saute haut", "sauter"],
              ["la souris marche sur les cailloux", "marcher"],
              ["la souris ouvre les portes", "ouvrir"],
              ["la souris offre des cadeaux", "offrir"],
              ["le chat regarde le chien", "regarder"],
              ["le chat monte les escaliers", "monter"],
]

for t in test_cases : test_verb(*t)
adrianeboyd commented 3 years ago

Sure, if you'd like to open a PR, please go ahead! We mainly test the lookup lemmatizers in the tests in that repo because we don't want to have to download pretrained pipelines for the test suite. You can construct docs and test by hand, but it's a bit of pain. We could potentially add more lemmatizer tests to the spacy-models repo, which is what we use to test newly trained models before releasing.

e-nesse commented 3 years ago

Having tried this solution ( nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']] ), I can report that it does not produce (only) the desired outcome. While it does fix many lemmatization errors for conjugated -ER verb forms, it also introduces errors in lemmatization of infinitive forms. Infinitives like "résoudre", "prendre", or "réduire" are assigned lemmas of "résoudrer", "prendrer", and "réduirer", respectively. Perhaps there's a way to account for verbs already in the infinitive form? I am not sure exactly how the lemmatization rules work, unfortunately - sorry! - but there might also be potential problems with subjunctive forms ending in -e (qu'on prenne, que je vienne, ...) being given non-existent lemmas (prenne --> prenner) if the verb form doesn't factor in to the rule somehow.

ioExpander commented 3 years ago

hi. Yes I encountered the same issue a few days ago that is why I have put off sending this PR to investigate further... The fix with the new lemma rule is really useful but indeed it breaks more complex sentences like the one in the example below. Note that the verb is correctly recognized by the morph as being in the infinitive (INF) form. So indeed skipping the lemmatizer rules for verbs in infinitive could be a way to go. I could not find an easy way to do it (besides doing it manually outside of spacy)

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]

doc =nlp("Je souhaite descendre dans la cave")
print([[t.lemma_, t.pos_, t.morph] for t in doc])

# => [['je', 'PRON', Number=Sing|Person=1], ['souhaiter', 'VERB', Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin], ['descendrer', 'VERB', VerbForm=Inf], ['dans', 'ADP', ], ['le', 'DET', Definite=Def|Gender=Fem|Number=Sing|PronType=Art], ['cave', 'NOUN', Gender=Fem|Number=Sing]]
adrianeboyd commented 3 years ago

The rule-based lemmatizer does have a mechanism for checking for forms like infinitives that are already lemmas and don't need to be processed further. There's not currently a check for French, but you can see what it looks like for English here:

https://github.com/explosion/spaCy/blob/ed561cf428494c2b7a6790cd4b91b5326102b59d/spacy/lang/en/lemmatizer.py#L5-L40

All you would need to do is add a similar is_base_form method to FrenchLemmatizer. It could work similarly and then as long as the tagger/morphologizer was correct (which is a bit of a caveat for most of the rules, of course), then you could skip infinitives with is_base_form and the new rule would only apply to finite verbs.