explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.34k stars 4.41k forks source link

Lemmatization issues [Italian][Spanish][French] #12954

Open databill86 opened 1 year ago

databill86 commented 1 year ago

Hello,

As a follow up on #11298 #11347, I would like to report some lemmatization problems with the spaCy 3.6 models for Italian, Spanish and French. We did not have these issues with the 3.2 version.

How to reproduce the behaviour

Here are some examples:

Language Text Returned Lemma Expected Lemma
it efficiente e cortesissima. corteso cortese
it Voglio **disabbonarmi Voglio volere
it Voglio disabbonaremi disabbonare disabbonare
it Bella Bella bello
it Perde il colore Perdere perdere
it Filtrare Filtrare filtrare
it Non si restringono al lavaggio restringono restringere
it Cassiera gentile Cassiera cassiera
it Trovo sempre un sacco Trovo trovare
it prodotto ottimo produrre prodotto
it Buongiorno Ho ricevuto un set di calzini diverso da quello da me selezionato nell'ordine Grazie [name] divergere diverso
it buono bebe Bebe bébé
it Richiedo la fatturazione elettronica Richiedo richiedere
it Cercavo una felpina Cercavo cercare
it Cercavo una felpina felpino felpina
it Manca lo short blu del set codice 461680 Manca mancare
it Soddisfatta ma nonostante 2 lavaggi perde ancora pelucchi soddisfattare soddisfatto
it Ben fatto ma troppo grande Ben bene
it Rapiditá nella consegna Rapidità rapidità
it (allego screenshot) allego allegare
it Prezzi competitivi Spedizione nei tempi previsti Acquistate per il black friday... Spedizione spedizione
it Ottima merce Ottimo ottimo
it quando mi rimborserete rimborsareta rimborsare
it non carica la pagina carico caricare
it È possibile avere un contatto o un riferimento del corriere? Corriere corriere
it Quando potrò effettuare il mio acquisto? potrò potere
it Quando lo consegnerete? consegnareta consegnare
it Ok...casomai rifaccio l'ordine rifacciareta rifare
it Compro sempre ordinando on-line e ritirando in negozio, ritirira ritirare
it Un po’ corte di manica corte corto
it Buonasera, mi è arrivato il pacco contenente tutto tranne il jeans blu con codice di vendita [number] contenente contenere
it che ora chiudete chiudetere chiudere
it non riesco a tracciarlo tracciare lo tracciare
es Problema con reparto Problema problema
es L'app esta caída no puedes realizar la compra L'app app
es Fallo en el uso de la aplicación Fallo fallo
es Contenta contentar contento
es solicito BAJA de la suscripción a las newsletters solicito solicitar
es "Desde hace 17 años compro en Kiabi" compro comprar
es Correcto Correcto correcto
es na cola se clientes en espera clientser cliente
es Hola a todos: que horarios tenéis tenéis tener
es Rapidez en pedidos rapidez Rapidez
es Mala prenda mala mal / malo
es Estupendo Estupendo estupendo
es no me lo han enviado pero si cobrado. cobrado cobrar
fr Bonjour, Aurez-vous la parure Aurez avoir
fr via le formulaire sur Internet Internet internet
fr Jolie modèle Jolie joli

I guess the issue with tokens on the beginning of sentences (because they are wrongly detected as PROPN) has been already mentioned many times.

Your Environment

adrianeboyd commented 1 year ago

Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future!

mathieuchateau commented 1 year ago

Also in french "domicile" becomes "domicil" which is not correct but "domiciles" (plural) become correctly "domicile".

A sanity check can be added :double lemmatization should not change result.


NLP_FR = spacy.load("fr_core_news_md")

print("domicile (singular)  should stay as domicile (singular)")
NLP_FR("domicile")[0].lemma_

print("domiciles (plural) should become domicile (singular)")
NLP_FR("domiciles")[0].lemma_

print("Doing a double lemmatization should not change result")
NLP_FR(NLP_FR(NLP_FR("domiciles")[0].lemma_)[0].lemma_)[0].lemma_

In version 3.7.2

adrianeboyd commented 1 year ago

The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer component.

Here it looks like it's tagging "domicil" as ADJ, so incorrect rules are applied.

The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:

import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"

If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:

import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile")  # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"
mathieuchateau commented 1 year ago

The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer component.

Here it looks like it's tagging "domicil" as ADJ, so incorrect rules are applied.

The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:

import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"

If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:

import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile")  # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"

Thanks for helping. It also explains why lemmatization does not work when disabling others modules. No error but then do nothing To save resource as I need only lemmatizer, I tried this:

NLP_FR = spacy.load("fr_core_news_md", disable=["morphologizer", "parser", "senter", "ner", "attribute_ruler"])

No error when calling .lemma_, but do nothing. Would be better to throw an error if possible.

In this example, word "domicil" does not exist at all in french dictionary.

Based on data I have to handle, around 1185 items where double lemmatization provide different (better) results.

version 3.2.0 was keeping "domicile" correct when submitted for lemmatization.

Would it be possible to train it with this project http://www.lexique.org/ ? They are very good for french, just no model giving vectors

adrianeboyd commented 1 year ago

We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd prefer to use it, it's pretty easy to create a lookup table that you can use with the lookup mode of the built-in spacy Lemmatizer.

We have an example of a French lookup lemmatizer table here:

https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_lookup.json

mathieuchateau commented 1 year ago

@adrianeboyd ok thanks for explanation.

I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view

NLP_FR("xxxxx")[0].lemma_
Out[10]: 'xxxx'
NLP_FR(NLP_FR("xxxxx")[0].lemma_)[0].lemma_
Out[11]: 'xxx'
adrianeboyd commented 1 year ago

Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the rule-based lemmatizer to the lookup lemmatizer: https://spacy.io/models#design-modify

You can also provide your own lookup table instead of using the default one from spacy-lookups-data.


I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view

This is not what is going on. Not that there can't be problems with the lemmatizer rules and tables, but I'd be very surprised if simply removing any final character were one of the existing rules for any of the rule-based lemmatizers provided in the trained pipelines.

You can take a closer look at the rules for French, which are here under fr_lemma_* (all the files except the _lookup.json one are used by the rule-based lemmatizer):

https://github.com/explosion/spacy-lookups-data/tree/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data

along with the language-specific lemmatizer implementation under spacy/lang/fr/lemmatizer.py.

These are suffix rewrite rules, and I think this is the rule that it's applying for the final x in nouns:

https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_rules.json#L62

mathieuchateau commented 1 year ago

@adrianeboyd thanks, using rules returns same results as before with 3.2, which are much better in our case. Also, using rules, no more removing of "x" in "xxxxx"