Open amir-zeldes opened 7 years ago
Question: how does lemmatizer know that ⲙⲙⲟ's lemma is ⲛ_ⲛⲧⲟ and ⲕⲏⲧ's lemma is ⲕⲱⲧ? We would definitely make use of such information, for example, extend dictionary with stative forms available, forms of pronouns (including portmanteau forms). Is there smth like speadsheets?
There are two resources that generate this information:
Live info from all lemmas currently in ANNIS, harvested using this script (it updates the DB by default, but you can have it output a text file instead using the outmode
argument): https://github.com/KELLIA/dictionary/blob/master/utils/make_lemma_table.py
The static file from the CMCL + DDGLC lemma list, which is also regularly updated with some more 'reliable' Scriptorium data, and is available here: https://github.com/CopticScriptorium/tokenizers/blob/master/copt_lex.tab
Hope this helps!
Yes, exactly. Thanks!
Search of portmanteau lemmas currently retrieves nothing. For example, the lemma of ⲙⲙⲟ 'mmo' (2nd person feminine) based on SC guidelines is ⲛ_ⲛⲧⲟ 'n_nto':
https://github.com/CopticScriptorium/tagger-part-of-speech/raw/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf
Suggested behavior: