explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.99k stars 4.39k forks source link

Inaccurate lemmas and POS tags for Greek ('el' model) #4108

Closed petasis closed 5 years ago

petasis commented 5 years ago

Hi all, I am facing problems with the model for the Greek language. Mainly for part of speech tags (failure on verbs is quite high) and lemmas. For example:

  1. 'Έχεις αδέρφια;' - Here the verb is tagged as noun.
    • Spacy: "Έχεις|έχει|NOUN αδέρφια|αδέρφι|NOUN ;|;|PUNCT"
    • Ellogon: "Έχεις|έχω|VERB αδέρφια|αδέρφι|NOUN ;|;|PUNCT"
  2. "Τι μέγεθος παπουτσιών φοράς;" - Here the lemma of the 3rd word is not a Greek word, and the verb is tagged as a noun.
    • Spacy: "Τι|τι|PRON μέγεθος|μέγεθος|NOUN παπουτσιών|παπουτσί|NOUN φοράς|φορά|NOUN ;|;|PUNCT"
    • Ellogon: "Τι|τι|PRON μέγεθος|μέγεθος|NOUN παπουτσιών|παπούτσι|NOUN φοράς|φορώ|VERB ;|;|PUNCT"
  3. "Τι μέγεθος παπούτσια φοράς;" - Here the lemma of the 3rd word is correct, the verb tagged as noun.
    • Spacy: "Τι|τι|PRON μέγεθος|μέγεθος|NOUN παπούτσια|παπούτσι|NOUN φοράς|φορά|NOUN ;|;|PUNCT"
    • Ellogon: "Τι|τι|PRON μέγεθος|μέγεθος|NOUN παπούτσια|παπούτσι|NOUN φοράς|φορά-φορώ-φοράδα|VERB ;|;|PUNCT"
  4. "Πώς είναι το να είσαι υπολογιστής;" - Here the pronoun is tagged as verb, and the first verb as AUX.
    • Spacy: "Πώς|πώς|VERB είναι|είναι|AUX το|το|DET να|να|PART είσαι|είμαι|VERB υπολογιστής|υπολογιστής|NOUN ;|;|PUNCT"
    • Ellogon: "Πώς|πώς|ADV είναι|είμαι|VERB το|ο|DET να|να|ADP είσαι|είμαι|VERB υπολογιστής|υπολογιστής|NOUN ;|;|PUNCT"
  5. "Ποια είναι η αγαπημένη σου γλώσσα;" - Here again the VERB/AUX mismatch, and the participle is tagged as a verb. Also, leemas from Ellogon are more correct.
    • Spacy: "Ποια|ποια|PRON είναι|είναι|AUX η|η|DET αγαπημένη|αγαπημένη|VERB σου|σου|PRON γλώσσα|γλώσσα|NOUN ;|;|PUNCT"
    • Ellogon: "Ποια|ποιος|PRON είναι|είμαι|VERB η|ο|DET αγαπημένη|αγαπώ|PART σου|εγώ|PRON γλώσσα|γλώσσα|NOUN ;|;|PUNCT"
  6. "Συμφωνείς με τον όρο Βόρεια Μακεδονία για τους βόρειους γείτονές μας;" - Here the verb is tagged as ADJ, and again the lemmas are better from Ellogon.
    • Spacy: "Συμφωνείς|συμφωνείς|ADJ με|με|ADP τον|τον|DET όρο|όρο|NOUN Βόρεια|βόρειος|ADJ Μακεδονία|μακεδονία|PROPN για|για|ADP τους|τους|DET βόρειους|βόρειους|ADJ γείτονές|γείτονέ|NOUN μας|μας|PRON ;|;|PUNCT"
    • Ellogon: "Συμφωνείς|συμφωνώ|VERB με|με|ADP τον|ο|DET όρο|όρος|NOUN Βόρεια|βόρειος|ADJ Μακεδονία|Μακεδονία|PROPN για|για|ADP τους|ο|DET βόρειους|βόρειος|ADJ γείτονές|γείτονας|NOUN μας|μου|PRON ;|;|PUNCT"

Do you know if a better model for Greek will be released soon? In the meantime, is it possible to replace the part-of-speech tagger and lemmatiser of the 'el' model with others that I have access to?

giannisdaras commented 5 years ago

Hi! Thanks for pointing this out.

First of all, an error in the PoS tagger propagates to a lemmatization error, because lemmatization uses the PoS tag in order to apply rules and find the correct lemma for each word. Therefore, I suspect that the main issue here is the incorrect PoS tags.

For the PoS tagger errors, the situation is as follows: the PoS tagger gets 95% accuracy on the dev set of the treebank it is trained on, which is the Universal Dependencies conversion of the Greek Dependency Treebank (v2.2). However, if you inspect a bit this treebank, you will notice that the language used there is quite different; questions are quite rare, there is no discussion but more or less declarative sentences that state facts about the word or support opinions.

One interesting thing to notice is that if you convert your questions to declarative sentences, spaCy is producing correct results. For example: "Εγώ έχω αδέρφια" gives: ['PRON', 'VERB', 'NOUN', 'PUNCT'] "Συμφωνώ με τον όρο Βόρεια Μακεδονία για τους βόρειους γείτονες μας" gives: ['VERB', 'ADP', 'DET', 'NOUN', 'ADJ', 'PROPN', 'ADP', 'DET', 'ADJ', 'NOUN', 'PRON'] You could reproduce this behavior for all your examples.

So, in general, I would say that this is obviously bad, but we do have to remember two things: (i) you are asking the model to predict on different types of sentences than the ones it was trained on, and (ii) you can always finetune your model with some extra annotation on your data, to get the desired behavior.

If you want to use a function from some other library, you could always create a new component and add this to your nlp pipeline, as described here.

ines commented 5 years ago

Merging this with #3052 🙂

petasis commented 5 years ago

Hi again,

I have installed spacy on a new machine (running the same os as the previous one), and I am getting different lemmas on the two machines. How can I debug this?

Machine A:

pip3 freeze|grep spacy spacy==2.1.8 python3 -m spacy download el --user Requirement already satisfied: el_core_news_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/el_core_news_sm-2.1.0/el_core_news_sm-2.1.0.tar.gz#egg=el_core_news_sm==2.1.0 in ./.local/lib/python3.7/site-packages (2.1.0) ? Download and installation successful You can now load the model via spacy.load('el_core_news_sm') ? Linking successful /home/petasis/.local/lib/python3.7/site-packages/el_core_news_sm --> /home/petasis/.local/lib/python3.7/site-packages/spacy/data/el You can now load the model via spacy.load('el')

Machine B:

pip3 freeze|grep spacy spacy==2.1.8 python3 -m spacy download el --user Requirement already satisfied: el_core_news_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/el_core_news_sm-2.1.0/el_core_news_sm-2.1.0.tar.gz#egg=el_core_news_sm==2.1.0 in /home/pepper/.local/lib/python3.7/site-packages (2.1.0) \u2714 Download and installation successful You can now load the model via spacy.load('el_core_news_sm') \u2714 Linking successful /home/pepper/.local/lib/python3.7/site-packages/el_core_news_sm --> /home/pepper/.local/lib/python3.7/site-packages/spacy/data/el You can now load the model via spacy.load('el')

But machine A for example returns "γνώρισο" for "γνώρισα", while machine B returns "γνώρισας". Most of the lemmas are the same, but there are some cases that different runs produce different results. How can I debug this?

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.