Open databill86 opened 1 year ago
Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future!
Also in french "domicile" becomes "domicil" which is not correct but "domiciles" (plural) become correctly "domicile".
A sanity check can be added :double lemmatization should not change result.
NLP_FR = spacy.load("fr_core_news_md")
print("domicile (singular) should stay as domicile (singular)")
NLP_FR("domicile")[0].lemma_
print("domiciles (plural) should become domicile (singular)")
NLP_FR("domiciles")[0].lemma_
print("Doing a double lemmatization should not change result")
NLP_FR(NLP_FR(NLP_FR("domiciles")[0].lemma_)[0].lemma_)[0].lemma_
In version 3.7.2
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer
component.
Here it looks like it's tagging "domicil" as ADJ
, so incorrect rules are applied.
The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:
import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"
If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:
import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile") # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the
morphologizer
component.Here it looks like it's tagging "domicil" as
ADJ
, so incorrect rules are applied.The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:
import spacy nlp = spacy.load("fr_core_news_md") assert nlp("le domicile")[1].lemma_ == "domicile"
If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:
import spacy nlp = spacy.load("fr_core_news_md") doc = nlp.make_doc("domicile") # just tokenization, no pipeline components doc[0].pos_ = "NOUN" assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"
Thanks for helping. It also explains why lemmatization does not work when disabling others modules. No error but then do nothing To save resource as I need only lemmatizer, I tried this:
NLP_FR = spacy.load("fr_core_news_md", disable=["morphologizer", "parser", "senter", "ner", "attribute_ruler"])
No error when calling .lemma_, but do nothing. Would be better to throw an error if possible.
In this example, word "domicil" does not exist at all in french dictionary.
Based on data I have to handle, around 1185 items where double lemmatization provide different (better) results.
version 3.2.0 was keeping "domicile" correct when submitted for lemmatization.
Would it be possible to train it with this project http://www.lexique.org/ ? They are very good for french, just no model giving vectors
We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd prefer to use it, it's pretty easy to create a lookup table that you can use with the lookup
mode of the built-in spacy Lemmatizer
.
We have an example of a French lookup lemmatizer table here:
@adrianeboyd ok thanks for explanation.
I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view
NLP_FR("xxxxx")[0].lemma_
Out[10]: 'xxxx'
NLP_FR(NLP_FR("xxxxx")[0].lemma_)[0].lemma_
Out[11]: 'xxx'
Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the rule-based lemmatizer to the lookup lemmatizer: https://spacy.io/models#design-modify
You can also provide your own lookup table instead of using the default one from spacy-lookups-data
.
I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view
This is not what is going on. Not that there can't be problems with the lemmatizer rules and tables, but I'd be very surprised if simply removing any final character were one of the existing rules for any of the rule-based lemmatizers provided in the trained pipelines.
You can take a closer look at the rules for French, which are here under fr_lemma_*
(all the files except the _lookup.json
one are used by the rule-based lemmatizer):
along with the language-specific lemmatizer implementation under spacy/lang/fr/lemmatizer.py
.
These are suffix rewrite rules, and I think this is the rule that it's applying for the final x
in nouns:
@adrianeboyd thanks, using rules returns same results as before with 3.2, which are much better in our case. Also, using rules, no more removing of "x" in "xxxxx"
Hello,
As a follow up on #11298 #11347, I would like to report some lemmatization problems with the spaCy 3.6 models for Italian, Spanish and French. We did not have these issues with the 3.2 version.
How to reproduce the behaviour
Here are some examples:
I guess the issue with tokens on the beginning of sentences (because they are wrongly detected as PROPN) has been already mentioned many times.
Your Environment