Lemmatizer performance - Githubissues

flammie / omorfi

Open morphology for Finnish

Other

84 stars 25 forks source link

I installed Omorfi strictly following instructions. Then I tested Python API lemmatizer using following simple code:

from omorfi.omorfi import Omorfi omorfi = Omorfi() omorfi.load_from_dir() result=omorfi.lemmatise(some_test_word)

However, the lemmatization performance does not seem to be up to par with FINTWOL online tool. Few examples:

kissattomat: FINTWOL: "kissaton" DN-TON A POS NOM P Omorfi: ('kissattomat', inf)

autottomat: FINTWOL: "autoton" DN-TON A POS NOM PL Omorfi: (('auto', 0.0), ('autoton', 0.0))

tipattomille: FINTWOL: "tipaton" DN-TON A POS ALL PL Omorfi: ('tipattomille', inf)

where 'inf' means failure to lemmatise word (unknown word). Dictionary of Omorfi/hfst tool seems very limited. Is there anything I can do to improve Omorfi performance, e.g., install some additional components or change Omorfi settings? I would like to get results similar to FINTWOL.

Good question! There's two ways to contribute to lexical data like this, usually preferred is to plain add the words to the dictionary. These go to src/lexemes.tsv, each new word requires 4 fields: lemma, homonym number, inflectional paradigm and source of origin (for copyright etc.), I have added these two in a commit https://github.com/flammie/omorfi/commit/d58cc503f5c3f515b3c3a20abf794ab7592f37de. Another way is to extend the derivational system, i.e. go through paradigms to add the suffixes, in this case that caritive, to right stem, you can see these two paradigms in the commit https://github.com/flammie/omorfi/commit/a7f904c08330d60f3ae722ca2ecb9d7c9e5ab317.

As for the FINTWOL analyses, those are not planned but if there's a simple mapping from omorfi.analyse() results to FINTWOL it can always be included.

flammie / omorfi

Lemmatizer performance #41