A few (compound) nouns with incorrect paradigms

anmoisio commented 1 day ago

There seems to be a few nouns with incorrect paradigms:

18 compound nouns ending in 'teline' that have the paradigm NOUN_ASTE which gives incorrect vowel harmony for cases such as ADE. The paradigm probably should be NOUN_PISTE (same as 'teline'). It seems that the vowel harmony is taken from the first part of the compound noun when it should be taken from the last part.
- autoteline
- kantoteline
- keittoteline
- ... see rest by searching 'teline noun noun_aste' in https://raw.githubusercontent.com/flammie/omorfi/refs/heads/main/src/lexemes.tsv
similarly, 25 compound nouns that end in 'vene' with paradigm NOUN_ASTE
8 compound nouns that end in 'piikki' with paradigm NOUN_RUUVI that gives wrong consonant gradation and vowel harmony, e.g. "hintapiikkilla", should be NOUN_HÄKKI
the paradigm of 'kuratointi' is NOUN_RUUVI, should be NOUN_SOINTI for correct consonant gradation

It seems that the paradigm guessing script(s) (omorfi/src/python/omorfi/entryguessing/guess_new_class.py ?) should be debugged and then run again to catch all similar errors in the paradigms? I can also help with this if needed.

flammie commented 1 day ago

Yes there are quite many of these, the culprit is indeed all the words imported from external sources, mostly wiktionaries, in the recent years. Each importer script has their own paradigm guessing that has been hacked together in very piecemeal manner, in src/external/fiwikt2omorfi.bash, src/externals/enwikt2omorfi.bash etc., while I have read through the guessed paradigms on import there are surely large groups of uncaught failures like this. If you can help fixing the scripts or existing data in one way or other I'll happily accept pull requests :-)

anmoisio commented 19 hours ago

Since fixing those scripts might not be very simple and quick, would be it possible to have some estimate of the quality of the words from different external sources, for the time being? These nouns with incorrect paradigms seem to be from enwikt. Are words from e.g. kotus assigned to paradigms more accurately? Even with 100% accuracy? This information would be useful for my application, which is just taking a random sample of word forms with some specific inflection. If I know kotus words have 100% correct paradigms, I could use exclusively those words. Thanks a lot for your help! I will take a look at those importer scripts at some point to see if I can fix (some part of) them in a reasonable amount of time. If not, I will just fix those words mentioned above in the lexemes.tsv and make a PR of that.

flammie / omorfi

A few (compound) nouns with incorrect paradigms #84