CopticScriptorium / tagger-part-of-speech

Part of speech tagger for Sahidic Coptic
http://coptic.pacific.edu
2 stars 1 forks source link

lemmatization of Greek loan words #5

Closed ctschroeder closed 8 years ago

ctschroeder commented 8 years ago

I'm noticing some inconsistencies, such as here with the verb metanoei. What is the best way to edit the lemmatizer?

ctschroeder commented 8 years ago

(adding an issue, because we may want to say something in the Lemmatization guidelines about i/ei in addition to fixing specific words)

ctschroeder commented 8 years ago

See also lemma=/ⲟⲩⲟⲓ/|/ⲟⲩⲟⲉⲓ/

amir-zeldes commented 8 years ago

Thanks for catching these! This is actually not a lemmatization issue, but rather a normalization issue. Even if we know that the standard lemma is metanoei, we'd also like the norm to read the same. The lemmatizer will generally guess for an unknown form that its lemma is the same as its norm, so it will continue to postulate a lemma metanoi next to metanoei. What we'd really like is for the normalizer to catch this.

I've updated these two items in the normalizer here: CopticScriptorium/normalizer@40eab033a3625ea8bf8a8ac40549cde8de2e1953

My sense is that the desirable forms are:

amir-zeldes commented 8 years ago

Oh, and the place to update is therefore the normalizer's norm_table.tab