divvun / CorpusTools

Tools to manage and convert GiellaLT corpus files
https://giellalt.github.io/CorpusTools/
GNU General Public License v3.0
3 stars 0 forks source link

Fix word model file for rmy-rlo, and add language code mapping #5

Closed pierrebeauguitte closed 1 year ago

pierrebeauguitte commented 1 year ago

The language model file rm-rlo.wm should be renamed to rmy-rlo.wm to follow file naming convention. Currently the file is ignored when creating a Classifier in text_cat.py.

A good number (19) of the language models in the lm/ directory use language codes that are not defined in ISO 639-2. I looked for the codes in the repository and on https://giellalt.github.io/, but haven't been able to find them. Could you document their signification?

albbas commented 1 year ago

We use ISO 639-3 as language codes for internal use. If needed, we convert them to ISO 639-2 when that is needed and applicable.

They are fetched from https://iso639-3.sil.org/code_tables/download_tables. We have a copy of that in our repo giella-core.

The -rlo suffix was added by @Trondtr because the rmy-code was too unprecise, I'll ask him to add a separate comment on that.

pierrebeauguitte commented 1 year ago

@albbas Thank you for the fix and for your answer!

Trondtr commented 1 year ago

The rm-rlo.wm is a typo, I intended it to be rmy-rlo.wm as you correctly point out. Thus, just rename it.

As for the more principled, issue, the code rlo (and its twin rka) for Lovari and Kalderash are totally ad hoc. What is intended is: rmy = Vlax Romani (correct ISO code) the additions -rXX meant -r for Romani and XX for some subgroup, here lovari and kalderash (there are several more varieties , according to Glottolog there are 12 subnodes to Vlax, to be precise (https://glottolog.org/resource/languoid/id/vlax1238), how many of them do possess a written standard like these to do in Sweden I do not know.

Lovari and Kalderash version of Vlach Romani I set as rmy-rlo and rmy-rka, better would perhaps be rmy-x-Lovari and rmy-x-Kalderash (the Glottolog codes are lova1240 and kald1238, respectively). I have got the understanding that -x-XXX (where XXX is some private subgrouping) is a common way of extending codes. At least it is transparent. Be that as it may, if the -rlo and -rka affixes work for the moment we should perhaps wait for more general makeovers to change things. The Right Thing To Do is of course to file an addition to ISO 639-3, there are procedures for that (the reference to 629-2 is irrelevant).