Closed luigibrancati closed 2 years ago
Hey @luigibrancati,
Thanks for the detailed report! Just to make sure, the correct output would have been rimanere
and not rimare
right?
Hi @kadarakos,
You're welcome! Yes, rimane
is 3rd person of present indicative tense for the verb rimanere
.
I don't know how much this helps, but I've noticed that TreeTagger, another lemmatizer I've tried, gives the same error!
I investigated the issue a bit. The Italian model seems to run the part-of-speech based lookup lemmatizer:
In [1]: import spacy
In [2]: nlp_acc = spacy.load("it_core_news_lg")
In [3]: lemmatizer = nlp_acc.get_pipe('lemmatizer')
In [4]: lemmatizer.mode
Out[4]: 'pos_lookup'
The part-of-speech tagger correctly classifies "rimane" as a VERB
:
In [9]: [t.pos_ for t in nlp_acc("Il galleggiante rimane a galla")]
Out[9]: ['DET', 'PROPN', 'VERB', 'ADP', 'NOUN']
so let's checkout the corresponding lookup table:
In [14]: lookup_table = lemmatizer.lookups.get_table('lemma_lookup_verb')
In [15]: lookup_table.get('rimango')
Out[15]: 'rimanere'
In [16]: lookup_table.get('rimani')
Out[16]: 'rimanere'
In [17]: lookup_table.get('rimane')
Out[17]: 'rimare'
In [18]: lookup_table.get('rimaniamo')
Out[18]: 'rimanere'
In [19]: lookup_table.get('rimanete')
Out[19]: 'rimanere'
In [20]: lookup_table.get('rimangono')
Out[20]: 'rimanere'
To me it seems like the issue is with the lookup table.
@kadarakos I agree with you, the issue seems to be with the lookup table itself. I didn't encounter any other error like this one, but now I know where to look!
Hey @luigibrancati,
Thanks again for your report! The rimane --> rimanere
fix in the lemmatizer lookup is merged: https://github.com/explosion/spacy-lookups-data. Please let us know if you find anything fishy again!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
For the code
The output is
The pipeline thinks that rimane is some conjugation of the verb rimare, which is wrong both based on the context and on the grammar (i.e. rimare never takes the form rimane).
This error happens as well for the light pipeline
it_core_news_sm
.Your Environment