Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
716 stars 132 forks source link

Support macrolanguages #1673

Open trang opened 6 years ago

trang commented 6 years ago

Problem

We currently have several languages which are defined in the ISO 639-3 specification as macrolanguage.

For instance Arabic (ara) encapsulates various individual languages: https://iso639-3.sil.org/code/ara.

In Tatoeba however, we handle ara like we handle individual languages:

If we want to follow correctly the ISO 639-3 specification, then searching/browsing sentences in ara should show all the individual languages that are part of the ara macrolanguage.

This is an issue we have noticed a couple years ago (cf. #1079). It has been recently brought up again by cueyayotl on the Wall.

Solutions

I suggested a solution for Arabic, which I wasn't fond of back then and am even less fond of today: when we have to deal with languages that belong to a macrolanguage, we only add the macrolanguage and the individual languages are handled with tags.

While this solution is technically possible, I think it's pretty clear that no one will like it. Not just for the fact that adding tags is extremely unpractical, but for the fact that merging the individual languages into a macrolanguage gives less recognition to the individual languages. Every individual language should be treated the same.

Our best solution is probably to add a column macrolang in the sentences and/or languages table(s).

trang commented 6 years ago

Note: we will have to figure out how to deal with Berber. From what I've checked, it is not considered as a macrolanguage but as a collection of languages. It turns out as well that "ber" is not an ISO 639-3 language code but an ISO 639-2/639-5 code.

jiru commented 6 years ago

Our best solution is probably to add a column macrolang in the sentences and/or languages table(s).

I don’t think we need to store this in the database, since macrolanguages consist of a well-known finite list. Hardcoding the list of languages in each macrolanguage seems good enough to me.

So macrolanguages would "just" be a way to search/browse through multiple languages at the same time, while it wouldn’t be possible to add sentences belonging to a macrolanguage. As long as we don’t start listing macrolanguages in stats pages, it doesn’t seem very difficult to implement to me.

A bigger problem I can see is to re-categorize sentences currently belonging to a macrolanguage into their respective "real" languages, which may not even have been added on Tatoeba yet.

Note that I recently raised a similar question about Albanian which is also a macrolanguage.

jiru commented 6 years ago

Note: we will have to figure out how to deal with Berber. From what I've checked, it is not considered as a macrolanguage but as a collection of languages. It turns out as well that "ber" is not an ISO 639-3 language code but an ISO 639-2/639-5 code.

According to the Library of Congress website:

Macrolanguages are distinguished from language collections in that the individual languages that correspond to a macrolanguage must be very closely related, and there must be some domain in which only a single language identity is recognized.

I cannot find a proper list of all the Berber languages, but it looks like there are many. Maybe it’s not clear which languages are part of the Berber family. Besides, one of them, Tuareg, is itself a macrolanguage.

trang commented 6 years ago

A bigger problem I can see is to re-categorize sentences currently belonging to a macrolanguage into their respective "real" languages, which may not even have been added on Tatoeba yet.

I actually had in mind that one could still add sentences to a macrolanguage.

We currently have the policy that any new language needs to have at least a few sentences. We don't have languages with zero sentences. Unless we change this policy, we will end up with people who want to be able to add their sentences to Arabic (ara) because we haven't added yet Algerian Saharan Arabic for instance. Of course they could add their sentences as "unknown language" but why not allow them to add the sentences to Arabic in the meantime?

It's also possible that a contributor is confident that their sentence is part of a macrolanguage but not confident about choosing an individual language. Perhaps because they don't know which one to choose, or because they think the sentence could be used in all the individual languages.

I cannot find a proper list of all the Berber languages

This is my main concern. As opposed to Arabic, it seems we cannot rely on an official list for Berber. I cannot find Tuareg in the ISO 639-3 list by the way.

jiru commented 6 years ago

We currently have the policy that any new language needs to have at least a few sentences. We don't have languages with zero sentences. Unless we change this policy, we will end up with people who want to be able to add their sentences to Arabic (ara) because we haven't added yet Algerian Saharan Arabic for instance. Of course they could add their sentences as "unknown language" but why not allow them to add the sentences to Arabic in the meantime?

I get your point. Such a thing could be indeed useful, but it looks like you’re trying to use "ara" in a way it wasn’t designed for. What you describe is not a macrolanguage. It’s a kind of catch-all language for Arabic, another "language" that should belong to the macrolanguage "ara", named like "Arabic (other variants)". If we use Arabic (ara) for that purpose, I’m afraid it will lead to a lot of confusion among contributors and external users of the corpus (because it doesn’t really follow the ISO standard). Not to mention UX problems about designing that in a understandable way: adding sentences to "ara" wouldn’t be the same thing as searching through "ara". This will also possibly confuse the language autodetection mechanism for Arabic languages.

I cannot find a proper list of all the Berber languages

This is my main concern. As opposed to Arabic, it seems we cannot rely on an official list for Berber.

Maybe it’s not easy to say which languages belong to the Berber family. The Wikipedia article about Berber languages says that the scope of Berber is still debated.

I cannot find Tuareg in the ISO 639-3 list by the way.

I was referring to TMH. It looks like the SIL name it Tamashek while it’s linked to "Tuareg languages" on Wikipedia.

jiru commented 3 months ago

This feature was requested by email on August 2024 for Serbo-Croatian.

LBeaudoux commented 3 months ago

To better understand the extent of this issue, I analysed the languages currently supported by Tatoeba. The code used to generate the tables below is available here.

Classification of Tatoeba language codes

Tatoeba only accepts languages with a valid ISO 639-3 code. However, 6 languages do not meet this requirement for various reasons.

Type Tatoeba language code
ISO 639-3 Deprecated ajp, kzj, tpw
ISO 639-3 Individual abk, abq, acm, ady, afb, afh, afr, aii, ain, akl, aln, alt, amh, ang, aoz, apc, arg, arn, arq, ary, arz, asm, ast, ava, avk, awa, ayl, bak, bam, ban, bar, bcl, bel, ben, bfz, bho, bis, bjn, bod, bom, bos, bre, brx, bul, bvy, bzt, cat, cay, cbk, ceb, ces, cha, che, chg, chn, cho, chr, chv, cjy, ckb, ckt, cmn, cmo, cor, cos, cpi, crh, crk, crs, csb, cym, cyo, dan, dar, deu, diq, div, dng, drt, dsb, dtp, dws, egl, ell, emx, eng, enm, epo, eus, evn, ewe, ext, fao, fij, fin, fkv, fra, frm, fro, frr, fry, fuc, fur, fuv, gaa, gag, gan, gbm, gcf, gil, gla, gle, glg, glv, gom, gos, got, grc, gsw, guc, guj, hak, hat, hau, haw, hax, hbo, hdn, heb, hif, hil, hin, hnj, hoc, hrv, hrx, hsb, hsn, hun, hye, hyw, iba, ibo, ido, igs, iii, ike, ile, ilo, ina, ind, inh, isl, ita, izh, jam, jav, jbo, jdt, jpa, jpn, kaa, kab, kal, kam, kan, kas, kat, kaz, kbd, kek, kha, khm, kin, kir, kiu, kjh, klj, kmr, knc, koi, kor, kpv, krc, krl, ksh, kum, kxi, laa, lad, lao, lat, lbe, ldn, lez, lfn, lij, lim, lin, lit, liv, lkt, lld, lmo, lou, ltg, ltz, lug, lut, lvs, lzh, lzz, mad, mah, mai, mal, mar, max, mdf, mfa, mfe, mgm, mhr, mic, mik, min, mkd, mlt, mnc, mni, mnr, mnw, moh, mri, mrj, mus, mvv, mwl, mww, mya, myv, nan, nap, nau, nav, nch, nds, new, ngt, ngu, niu, nld, nlv, nnb, nno, nob, nog, non, nov, npi, nst, nus, nya, nys, oar, oci, ofs, ood, orv, osp, oss, osx, ota, otk, pag, pal, pam, pan, pap, pau, pcd, pdc, pes, pfl, phn, pli, pms, pnb, pol, por, ppl, prg, quc, qxq, qya, rap, rel, rhg, rif, roh, ron, rue, run, rus, ryu, sag, sah, sat, scn, sco, sdh, sgs, shi, shs, shy, sin, sjn, skr, slk, slv, sma, sme, smo, sna, snd, som, sot, spa, srn, srp, ssw, stq, sun, sux, swc, swe, swg, swh, syc, syl, szl, tah, tam, tat, tel, tet, tgk, tgl, tha, thv, tig, tir, tkl, tlh, tly, tmr, tmw, toi, tok, ton, tpi, tsn, tso, tts, tuk, tum, tur, tvl, tyv, tzl, udm, uig, ukr, umb, urd, urh, vec, vep, vie, vol, vro, war, wln, wol, wuu, xal, xho, xmf, xqa, yor, yua, yue, zea, zgh, zlm, zsm, zul
ISO 639-3 Macrolanguage ara, aym, aze, bal, bua, est, grn, mlg, mon, oji, ori, pus, que, rom, san, sqi, srd, uzb, yid, zza
ISO 639-5 ber, nah
Not ISO 639 cycl

Supported macrolanguages

Tatoeba already supports 20 macrolanguages. But only Arabic and Zaza combine several individual languages.

Macrolanguage Individual languages
Albanian [sqi] Gheg Albanian [aln]
Arabic [ara] Mesopotamian Arabic [acm], Gulf Arabic [afb], Levantine Arabic [apc], Algerian Arabic [arq], Moroccan Arabic [ary], Egyptian Arabic [arz], Libyan Arabic [ayl]
Aymara [aym]
Azerbaijani [aze]
Baluchi [bal]
Buriat [bua]
Estonian [est] Võro [vro]
Guarani [grn]
Malagasy [mlg]
Mongolian [mon]
Ojibwa [oji]
Oriya (macrolanguage) [ori]
Pushto [pus]
Quechua [que]
Romany [rom]
Sanskrit [san]
Sardinian [srd]
Uzbek [uzb]
Yiddish [yid]
Zaza [zza] Dimli (individual language) [diq], Kirmanjki (individual language) [kiu]

Missing macrolanguages

22 macrolanguages are missing, including 13 which would combine several individual languages.

Macrolanguage Individual languages
Bikol [bik] Central Bikol [bcl]
Chinese [zho] Jinyu Chinese [cjy], Mandarin Chinese [cmn], Gan Chinese [gan], Hakka Chinese [hak], Xiang Chinese [hsn], Literary Chinese [lzh], Min Nan Chinese [nan], Wu Chinese [wuu], Yue Chinese [yue]
Cree [cre] Plains Cree [crk]
Fulah [ful] Pulaar [fuc], Nigerian Fulfulde [fuv]
Haida [hai] Southern Haida [hax], Northern Haida [hdn]
Hmong [hmn] Hmong Njua [hnj], Hmong Daw [mww]
Inuktitut [iku] Eastern Canadian Inuktitut [ike]
Kanuri [kau] Central Kanuri [knc]
Komi [kom] Komi-Permyak [koi], Komi-Zyrian [kpv]
Konkani (macrolanguage) [kok] Goan Konkani [gom]
Kurdish [kur] Central Kurdish [ckb], Northern Kurdish [kmr], Southern Kurdish [sdh]
Lahnda [lah] Western Panjabi [pnb], Saraiki [skr]
Latvian [lav] Latgalian [ltg], Standard Latvian [lvs]
Malay (macrolanguage) [msa] Banjar [bjn], Indonesian [ind], North Moluccan Malay [max], Pattani Malay [mfa], Minangkabau [min], Temuan [tmw], Malay (individual language) [zlm], Standard Malay [zsm]
Mari (Russia) [chm] Eastern Mari [mhr], Western Mari [mrj]
Nepali (macrolanguage) [nep] Nepali (individual language) [npi]
Norwegian [nor] Norwegian Nynorsk [nno], Norwegian Bokmål [nob]
Persian [fas] Iranian Persian [pes]
Serbo-Croatian [hbs] Bosnian [bos], Croatian [hrv], Serbian [srp]
Swahili (macrolanguage) [swa] Congo Swahili [swc], Swahili (individual language) [swh]
Syriac [syr] Assyrian Neo-Aramaic [aii]
Tamashek [tmh] Tahaggart Tamahaq [thv]
jiru commented 3 months ago

Thank you @LBeaudoux.

Tatoeba already supports 20 macrolanguages.

I think there is a confusion here.

Tatoeba does not support any single macrolanguage in the sense that there is no concept of grouping languages under a single macrolanguage ISO code. Instead, macrolanguage codes that are currently in use on Tatoeba are all improperly used as if they were individual languages. They are the result of our lack of knowledge back then, when they were added years ago. I guess we didn’t know exactly which code to take and ended up misusing macrolanguage codes. (This, or the SIL changed some individual language codes into macrolanguage after they were added to Tatoeba.)

LBeaudoux commented 3 months ago

I think there is a confusion here. Tatoeba does not support any single macrolanguage in the sense that there is no concept of grouping languages under a single macrolanguage ISO code.

Sorry for the ambiguous wording. My only aim was to visualise which of Tatoeba's current languages would be affected by the introduction of macrolanguages.

LBeaudoux commented 3 months ago

I've been thinking about this issue and would like to add the following comments.

I actually had in mind that one could still add sentences to a macrolanguage.

I agree that this is a better option. Being able to add sentences in a macrolanguage has more advantages than disadvantages. I also think we should continue to add new macrolanguages when users request them.

UX problems about designing that in a understandable way: adding sentences to "ara" wouldn’t be the same thing as searching through "ara".

Initially, this new search feature should only be available to power users. This could take the form of an opt-in checkbox in the user settings that says Show search results from mutually intelligible languages.

The languages displayed by the search language selectors would not change when the option is enabled. but the selected language would point to a list of languages that includes the corresponding macrolanguage along with all its individual languages. This list of languages would then be included in the request sent to Manticore.

As opposed to Arabic, it seems we cannot rely on an official list for Berber.

As the Berber [ber] and Nahuatl [nah] ISO 639-5 language groups are not recognised by an ISO 639-3 macrolanguage code, they may not consist only of mutually intelligible languages. Therefore, I don't think they should be included in the scope of this issue.

I don’t think we need to store this in the database, since macrolanguages consist of a well-known finite list. Hardcoding the list of languages in each macrolanguage seems good enough to me.

I have created a JSON file that maps each ISO 639-3 language code to the language codes that share the same macrolanguage. The Python code used to generate it is available here.