Open anmoisio opened 1 day ago
Yes there are quite many of these, the culprit is indeed all the words imported from external sources, mostly wiktionaries, in the recent years. Each importer script has their own paradigm guessing that has been hacked together in very piecemeal manner, in src/external/fiwikt2omorfi.bash
, src/externals/enwikt2omorfi.bash
etc., while I have read through the guessed paradigms on import there are surely large groups of uncaught failures like this. If you can help fixing the scripts or existing data in one way or other I'll happily accept pull requests :-)
Since fixing those scripts might not be very simple and quick, would be it possible to have some estimate of the quality of the words from different external sources, for the time being? These nouns with incorrect paradigms seem to be from enwikt. Are words from e.g. kotus assigned to paradigms more accurately? Even with 100% accuracy? This information would be useful for my application, which is just taking a random sample of word forms with some specific inflection. If I know kotus words have 100% correct paradigms, I could use exclusively those words. Thanks a lot for your help! I will take a look at those importer scripts at some point to see if I can fix (some part of) them in a reasonable amount of time. If not, I will just fix those words mentioned above in the lexemes.tsv and make a PR of that.
There seems to be a few nouns with incorrect paradigms:
It seems that the paradigm guessing script(s) (omorfi/src/python/omorfi/entryguessing/guess_new_class.py ?) should be debugged and then run again to catch all similar errors in the paradigms? I can also help with this if needed.