KELLIA / dictionary

The dictionary comprised of the Coptic lexicon created by the BBAW and interface by Coptic SCRIPTORIUM. Currently deployed at https://coptic-dictionary.org
28 stars 12 forks source link

Handling of portmanteau lemmas #23

Open amir-zeldes opened 7 years ago

amir-zeldes commented 7 years ago

Search of portmanteau lemmas currently retrieves nothing. For example, the lemma of ⲙⲙⲟ 'mmo' (2nd person feminine) based on SC guidelines is ⲛ_ⲛⲧⲟ 'n_nto':

https://github.com/CopticScriptorium/tagger-part-of-speech/raw/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf

Suggested behavior:

phoenix-mossimo commented 6 years ago

Question: how does lemmatizer know that ⲙⲙⲟ's lemma is ⲛ_ⲛⲧⲟ and ⲕⲏⲧ's lemma is ⲕⲱⲧ? We would definitely make use of such information, for example, extend dictionary with stative forms available, forms of pronouns (including portmanteau forms). Is there smth like speadsheets?

amir-zeldes commented 6 years ago

There are two resources that generate this information:

  1. Live info from all lemmas currently in ANNIS, harvested using this script (it updates the DB by default, but you can have it output a text file instead using the outmode argument): https://github.com/KELLIA/dictionary/blob/master/utils/make_lemma_table.py

  2. The static file from the CMCL + DDGLC lemma list, which is also regularly updated with some more 'reliable' Scriptorium data, and is available here: https://github.com/CopticScriptorium/tokenizers/blob/master/copt_lex.tab

Hope this helps!

phoenix-mossimo commented 6 years ago

Yes, exactly. Thanks!