CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

lemmatization #10

Open ctschroeder opened 7 years ago

ctschroeder commented 7 years ago

All corpora need to be checked for lemmatization of ⲩⲛⲟⲩ; should be ⲟⲩⲛⲟⲩ. See this ANNIS search

Corpora should also be checked for ⲡⲱⲛⲅ; should lemmatize as ⲡⲱⲛⲕ (ⲡⲱⲛⲅ a known variant, not sure if it should be normalized).

(Also lemmatizer should be checked)

ctschroeder commented 7 years ago

adjust in lemmatizer: inconsistent lemmatization of ⲙⲟⲓϩⲉ/ⲙⲟⲉⲓϩⲉ (should probably be normalized/lemmatized ⲙⲟⲉⲓϩⲉ as in Crum); https://corpling.uis.georgetown.edu/annis/?id=2d63101d-4e33-48e5-9f54-d9c2a4c900e4

also

another normalization/lemmatization issue: we are normalizing and lemmatizing ϩⲟⲉⲓⲧⲉ to itself and ϩⲟⲓⲧⲉ to itself. (Also dictionary lists it as ϩⲟ(ⲉ)ⲓⲧⲉ, which links to nothing in ANNIS of course.) https://corpling.uis.georgetown.edu/annis/?id=40917f1a-f549-43b8-b633-d35f704533c0 . We should at least change lemmas to ϩⲟⲉⲓⲧⲉ

amir-zeldes commented 7 years ago

I think this is a normalization issue for moihe. For unou it's different, since after a vowel that is actually the expected (normal and hence norm) spelling.