apertium / apertium-kaz

Apertium linguistic data for Kazakh
https://apertium.github.io/apertium-kaz/
GNU General Public License v3.0
17 stars 9 forks source link

Redundant and miscategorized stems in apertium-kaz.kaz.lexc #11

Open IlnarSelimcan opened 5 years ago

IlnarSelimcan commented 5 years ago

The vocabulary of apertium-kaz.kaz.lexc requires checking for redundancy, consistency and miscategorizations. Here are some examples:

кептірген:кептірген A1 ; ! ""
аршылған:аршылған A1 ; ! ""
жонылған:жонылған A1 ; ! ""
сүрілген:сүрілген A1 ; ! ""

Along with that, reasons why these are considered mistakes, and, generally, choices made should be documented in apertium-kaz/docs so that this kind of issues don't happen in the future.

At that point, (since the coverage of apertium-kaz is relatively high, that documentation will probably be more useful for other (Turkic) languages rather than for Kazakh.

IlnarSelimcan commented 5 years ago

Instead of going over the list of stems found in kaz.lexc and checking them, I decided to start with surface forms from a frequency list made out of the subset of kitap.kz books (the ones which presumably are all in the public domain) Kazakh translation of the Little Prince (just to try the idea on something smaller). You can find the words from that frequency list (the ones I already tested) here: https://github.com/taruen/apertiumpp/blob/master/data4apertium/vocabulary/kaz.rkt

My logic here was that:

Stiil, all stems currently in apertium-kaz.kaz.lexc will have to be checked. Once I'm done with surface forms from the Little Prince (and maybe the public domain subset of kitap.kz), I'll just take the difference of the wordlist in https://github.com/taruen/apertiumpp/blob/master/data4apertium/vocabulary/kaz.rkt and stemlist in apertium-kaz.kaz.lexc as what remains to be checked.

This is a reminder for myself to do that.

jonorthwash commented 5 years ago

Note, a GCI student wrote a lexc parser and lexicon deduplicator a couple years ago. Let me know if you want help digging it up.

jonorthwash commented 5 years ago

Relevant tools: https://github.com/apertium/apertium-on-github/issues/51

IlnarSelimcan commented 5 years ago

Turns out that the explanatory dictionary of Kazakh has been put online kitap.kz. So the task is, at the minimum, to check POS of apertium-kaz.kaz.lexc stems with that dictionary.

However, that dictionary might be under some CC license, as other things on kitap.kz seem to be. If it is, then example sentences and explanations could be used in the apertium project too. I'll need to figure out which particular license that dictionary is published under. Also see: https://yvision.kz/post/416129