Lemmatization issues [Italian][Spanish][French]

databill86 commented 1 year ago

Hello,

As a follow up on #11298 #11347, I would like to report some lemmatization problems with the spaCy 3.6 models for Italian, Spanish and French. We did not have these issues with the 3.2 version.

How to reproduce the behaviour

Here are some examples:

Language	Text	Returned Lemma	Expected Lemma
it	efficiente e cortesissima.	corteso	cortese
it	Voglio **disabbonarmi	Voglio	volere
it	Voglio disabbonaremi	disabbonare	disabbonare
it	Bella	Bella	bello
it	Perde il colore	Perdere	perdere
it	Filtrare	Filtrare	filtrare
it	Non si restringono al lavaggio	restringono	restringere
it	Cassiera gentile	Cassiera	cassiera
it	Trovo sempre un sacco	Trovo	trovare
it	prodotto ottimo	produrre	prodotto
it	Buongiorno Ho ricevuto un set di calzini diverso da quello da me selezionato nell'ordine Grazie [name]	divergere	diverso
it	buono bebe	Bebe	bébé
it	Richiedo la fatturazione elettronica	Richiedo	richiedere
it	Cercavo una felpina	Cercavo	cercare
it	Cercavo una felpina	felpino	felpina
it	Manca lo short blu del set codice 461680	Manca	mancare
it	Soddisfatta ma nonostante 2 lavaggi perde ancora pelucchi	soddisfattare	soddisfatto
it	Ben fatto ma troppo grande	Ben	bene
it	Rapiditá nella consegna	Rapidità	rapidità
it	(allego screenshot)	allego	allegare
it	Prezzi competitivi Spedizione nei tempi previsti Acquistate per il black friday...	Spedizione	spedizione
it	Ottima merce	Ottimo	ottimo
it	quando mi rimborserete	rimborsareta	rimborsare
it	non carica la pagina	carico	caricare
it	È possibile avere un contatto o un riferimento del corriere?	Corriere	corriere
it	Quando potrò effettuare il mio acquisto?	potrò	potere
it	Quando lo consegnerete?	consegnareta	consegnare
it	Ok...casomai rifaccio l'ordine	rifacciareta	rifare
it	Compro sempre ordinando on-line e ritirando in negozio,	ritirira	ritirare
it	Un po’ corte di manica	corte	corto
it	Buonasera, mi è arrivato il pacco contenente tutto tranne il jeans blu con codice di vendita [number]	contenente	contenere
it	che ora chiudete	chiudetere	chiudere
it	non riesco a tracciarlo	tracciare lo	tracciare
es	Problema con reparto	Problema	problema
es	L'app esta caída no puedes realizar la compra	L'app	app
es	Fallo en el uso de la aplicación	Fallo	fallo
es	Contenta	contentar	contento
es	solicito BAJA de la suscripción a las newsletters	solicito	solicitar
es	"Desde hace 17 años compro en Kiabi"	compro	comprar
es	Correcto	Correcto	correcto
es	na cola se clientes en espera	clientser	cliente
es	Hola a todos: que horarios tenéis	tenéis	tener
es	Rapidez en pedidos	rapidez	Rapidez
es	Mala prenda	mala	mal / malo
es	Estupendo	Estupendo	estupendo
es	no me lo han enviado pero si cobrado.	cobrado	cobrar
fr	Bonjour, Aurez-vous la parure	Aurez	avoir
fr	via le formulaire sur Internet	Internet	internet
fr	Jolie modèle	Jolie	joli

I guess the issue with tokens on the beginning of sentences (because they are wrongly detected as PROPN) has been already mentioned many times.

Your Environment

Python Version Used: 3.10
spaCy Version Used: 3.6.1

adrianeboyd commented 12 months ago

Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future!

mathieuchateau commented 10 months ago

Also in french "domicile" becomes "domicil" which is not correct but "domiciles" (plural) become correctly "domicile".

A sanity check can be added :double lemmatization should not change result.


NLP_FR = spacy.load("fr_core_news_md")

print("domicile (singular)  should stay as domicile (singular)")
NLP_FR("domicile")[0].lemma_

print("domiciles (plural) should become domicile (singular)")
NLP_FR("domiciles")[0].lemma_

print("Doing a double lemmatization should not change result")
NLP_FR(NLP_FR(NLP_FR("domiciles")[0].lemma_)[0].lemma_)[0].lemma_

In version 3.7.2

adrianeboyd commented 10 months ago

The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer component.

Here it looks like it's tagging "domicil" as ADJ, so incorrect rules are applied.

The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:

import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"

If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:

import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile")  # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"

mathieuchateau commented 10 months ago

The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer component.

Here it looks like it's tagging "domicil" as ADJ, so incorrect rules are applied.

The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:
import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"
If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:
import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile")  # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"

Thanks for helping. It also explains why lemmatization does not work when disabling others modules. No error but then do nothing To save resource as I need only lemmatizer, I tried this:

NLP_FR = spacy.load("fr_core_news_md", disable=["morphologizer", "parser", "senter", "ner", "attribute_ruler"])

No error when calling .lemma_, but do nothing. Would be better to throw an error if possible.

In this example, word "domicil" does not exist at all in french dictionary.

Based on data I have to handle, around 1185 items where double lemmatization provide different (better) results.

version 3.2.0 was keeping "domicile" correct when submitted for lemmatization.

Would it be possible to train it with this project http://www.lexique.org/ ? They are very good for french, just no model giving vectors

adrianeboyd commented 10 months ago

We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd prefer to use it, it's pretty easy to create a lookup table that you can use with the lookup mode of the built-in spacy Lemmatizer.

We have an example of a French lookup lemmatizer table here:

https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_lookup.json

mathieuchateau commented 10 months ago

@adrianeboyd ok thanks for explanation.

I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view

NLP_FR("xxxxx")[0].lemma_
Out[10]: 'xxxx'
NLP_FR(NLP_FR("xxxxx")[0].lemma_)[0].lemma_
Out[11]: 'xxx'

adrianeboyd commented 10 months ago

Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the rule-based lemmatizer to the lookup lemmatizer: https://spacy.io/models#design-modify

You can also provide your own lookup table instead of using the default one from spacy-lookups-data.

I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view

This is not what is going on. Not that there can't be problems with the lemmatizer rules and tables, but I'd be very surprised if simply removing any final character were one of the existing rules for any of the rule-based lemmatizers provided in the trained pipelines.

You can take a closer look at the rules for French, which are here under fr_lemma_* (all the files except the _lookup.json one are used by the rule-based lemmatizer):

https://github.com/explosion/spacy-lookups-data/tree/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data

along with the language-specific lemmatizer implementation under spacy/lang/fr/lemmatizer.py.

These are suffix rewrite rules, and I think this is the rule that it's applying for the final x in nouns:

https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_rules.json#L62

mathieuchateau commented 9 months ago

@adrianeboyd thanks, using rules returns same results as before with 3.2, which are much better in our case. Also, using rules, no more removing of "x" in "xxxxx"

explosion / spaCy

Lemmatization issues [Italian][Spanish][French] #12954

How to reproduce the behaviour

Your Environment