CopticScriptorium / tagger-part-of-speech

Part of speech tagger for Sahidic Coptic
http://coptic.pacific.edu
2 stars 1 forks source link

lemmatization #3

Open ctschroeder opened 8 years ago

ctschroeder commented 8 years ago

While editing the corpus I See Your Eagerness, I noticed the following:

amir-zeldes commented 8 years ago

A lot of these are actually normalization issues - I've updated the normalizer and these should hopefully be working now. The ones that are lemma problems (missing VSTATs for example) are now in the tagger lexicon but won't update until the next tagger release, which is planned.

Other misc. answers:

ctschroeder commented 8 years ago

Thank you so much!

The ones you can't reproduce may be due to an earlier version of the tagger and lemmatizer. I think some of these pre-date the DDGLC list and other updates.

I'm flexible on ⲛⲁϩⲣⲛ and ⲛⲛⲁϩⲣⲛ. I also wondered if ⲛⲛⲁϩⲣⲛ should be two units but saw both as their own lemmas in our existing corpora and became confused. If we go to ⲛⲁϩⲣⲛ only as both lemma and normalized unit (and as preposition), that is fine with me; we just need to be sure to update tokenizer, pos tagger, and lemmatizer right? Should I make a separate issue?

ctschroeder commented 7 years ago

@amir-zeldes what did we decide on ⲛⲁϩⲣⲛ and ⲛⲛⲁϩⲣⲛ?

amir-zeldes commented 7 years ago

Sorry for dropping the ball on this - I tried and saw that two PREPs really stands out in the tagger preps so maybe it's better to avoid it. On the other hand, I still feel strongly that nahrn is not the same as nnahrn. So maybe let's make them two distinct norms and lemmas, but in both cases just one norm. What do you think, should we morph it as two units n+nahrn?

ctschroeder commented 7 years ago

I am not sure this is really a morph, though, which is why I'm hesitant. Crum lists ⲛⲛⲁϩⲣⲛ as ⲛⲁϩⲣⲛ in older manuscripts. But if you feel strongly that they are different, then we should have some way to distinguish. I'm not sure the morph layer is right. Can we check with Eitan? or maybe Eitan and Sebastian both? It looks like only ⲛⲁϩⲣⲛ appears in the online lexicon. I think my preference is either treat nn... as two prepositions OR as a regular spelling variant (=> nn... not normalized to one n but both forms lemmatized to the same lemma). But if you feel strongly that it's not a variant, and the linguistic community thinks the extra n is not a preposition, then we should probably do as you suggest and make it a morph. Let's consult though?

amir-zeldes commented 7 years ago

Sorry for the slow reply - yes, let's ask Eitan. I'll ping him.

ctschroeder commented 7 years ago

Did we ever get an assessment of ⲛⲛⲁϩⲣⲛ from Eitan?

amir-zeldes commented 7 years ago

Not yet, I can check back