Handling of portmanteau lemmas

amir-zeldes commented 7 years ago

Search of portmanteau lemmas currently retrieves nothing. For example, the lemma of ⲙⲙⲟ 'mmo' (2nd person feminine) based on SC guidelines is ⲛ_ⲛⲧⲟ 'n_nto':

https://github.com/CopticScriptorium/tagger-part-of-speech/raw/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf

Suggested behavior:

Dictionary lookup of lemma forms containing an underscore (which is otherwise impossible) should be split before search
Both parts are searched for
Any results found are displayed as suggestions for individual entries the user might want to look for
This should work much like looking for an inflected form which the interface realizes is a form of some other lemma, e.g. when searching for a stative: https://corpling.uis.georgetown.edu/coptic-dictionary/results.cgi?quick_search=%E2%B2%95%E2%B2%8F%E2%B2%A7

phoenix-mossimo commented 6 years ago

Question: how does lemmatizer know that ⲙⲙⲟ's lemma is ⲛ_ⲛⲧⲟ and ⲕⲏⲧ's lemma is ⲕⲱⲧ? We would definitely make use of such information, for example, extend dictionary with stative forms available, forms of pronouns (including portmanteau forms). Is there smth like speadsheets?

amir-zeldes commented 6 years ago

There are two resources that generate this information:

Live info from all lemmas currently in ANNIS, harvested using this script (it updates the DB by default, but you can have it output a text file instead using the outmode argument): https://github.com/KELLIA/dictionary/blob/master/utils/make_lemma_table.py
The static file from the CMCL + DDGLC lemma list, which is also regularly updated with some more 'reliable' Scriptorium data, and is available here: https://github.com/CopticScriptorium/tokenizers/blob/master/copt_lex.tab

Hope this helps!

phoenix-mossimo commented 6 years ago

Yes, exactly. Thanks!

KELLIA / dictionary

Handling of portmanteau lemmas #23