explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.06k stars 4.4k forks source link

Inaccurate lemmatization in italian #10359

Closed luigibrancati closed 2 years ago

luigibrancati commented 2 years ago

How to reproduce the behaviour

For the code

nlp_acc = spacy.load("it_core_news_lg")
print([t.lemma_ for t in nlp_acc("Il galleggiante rimane a galla")])
# EN: the buoy stays afloat

The output is

['il', 'galleggiante', 'rimare', 'a', 'galla']
# EN: the buoy rhymes afloat

The pipeline thinks that rimane is some conjugation of the verb rimare, which is wrong both based on the context and on the grammar (i.e. rimare never takes the form rimane).

This error happens as well for the light pipeline it_core_news_sm.

Your Environment

kadarakos commented 2 years ago

Hey @luigibrancati,

Thanks for the detailed report! Just to make sure, the correct output would have been rimanere and not rimare right?

luigibrancati commented 2 years ago

Hi @kadarakos,

You're welcome! Yes, rimane is 3rd person of present indicative tense for the verb rimanere.

luigibrancati commented 2 years ago

I don't know how much this helps, but I've noticed that TreeTagger, another lemmatizer I've tried, gives the same error!

kadarakos commented 2 years ago

I investigated the issue a bit. The Italian model seems to run the part-of-speech based lookup lemmatizer:

In [1]: import spacy

In [2]: nlp_acc = spacy.load("it_core_news_lg")

In [3]: lemmatizer = nlp_acc.get_pipe('lemmatizer')

In [4]: lemmatizer.mode
Out[4]: 'pos_lookup'

The part-of-speech tagger correctly classifies "rimane" as a VERB:

In [9]: [t.pos_ for t in nlp_acc("Il galleggiante rimane a galla")]
Out[9]: ['DET', 'PROPN', 'VERB', 'ADP', 'NOUN']

so let's checkout the corresponding lookup table:

In [14]: lookup_table = lemmatizer.lookups.get_table('lemma_lookup_verb')

In [15]: lookup_table.get('rimango')
Out[15]: 'rimanere'

In [16]: lookup_table.get('rimani')
Out[16]: 'rimanere'

In [17]: lookup_table.get('rimane')
Out[17]: 'rimare'

In [18]: lookup_table.get('rimaniamo')
Out[18]: 'rimanere'

In [19]: lookup_table.get('rimanete')
Out[19]: 'rimanere'

In [20]: lookup_table.get('rimangono')
Out[20]: 'rimanere'

To me it seems like the issue is with the lookup table.

luigibrancati commented 2 years ago

@kadarakos I agree with you, the issue seems to be with the lookup table itself. I didn't encounter any other error like this one, but now I know where to look!

kadarakos commented 2 years ago

Hey @luigibrancati,

Thanks again for your report! The rimane --> rimanere fix in the lemmatizer lookup is merged: https://github.com/explosion/spacy-lookups-data. Please let us know if you find anything fishy again!

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.