Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
697 stars 132 forks source link

Rename languages to match ISO 639-3 names #1670

Open sabretou opened 6 years ago

sabretou commented 6 years ago

As we are introducing the new language selector, perhaps we should have languages match their ISO 639-3 names. This is because some languages used parentheses or alternate names for easier discovery earlier.

Here are my suggestions:

Language Code -> Current Name -> Proposed Name

cmn -> Chinese (Mandarin) -> Mandarin Chinese nob -> Norwegian (Bokmål) -> Norwegian Bokmål nno -> Norwegian (Nynorsk) -> Norwegian Nynorsk nst -> Naga (Tangshang) -> Tase Naga pan -> Punjabi (Eastern) -> Punjabi (Punjabi is by far the more popular spelling variant, so I recommend going with that. Alternately, we could add 'Panjabi' in parentheses). zsm -> Malay -> Standard Malay mww -> Hmong Daw (White) -> Hmong Daw afb -> Arabic (Gulf) -> Gulf Arabic pnb -> Punjabi (Western) -> Western Punjabi (I propose 'Punjabi' over 'Panjabi' for the same reason as above) aln -> Albanian (Gheg) -> Gheg Albanian jdt -> Juhuri (Judeo-Tat) -> Judeo-Tat cjy -> Chinese (Jin) -> Jinyu Chinese hnj -> Hmong Njua (Green) -> Hmong Njua bcl -> Bikol (Central) -> Central Bikol pfl -> Palatine German -> Pfaelzisch orv -> Old East Slavic -> Old Russian prg -> Old Prussian -> Prussian cmo -> Mnong, Central -> Central Mnong acm -> Iraqi Arabic -> Mesopotamian Arabic jam -> Jamaican Patois -> Jamaican Creole English mhr -> Meadow Mari -> Eastern Mari mrj -> Hill Mari -> Western Mari dtp -> Central Dusun -> Kadazan Dusun wuu -> Shanghainese -> Wu Chinese yue -> Cantonese -> Yue Chinese pes -> Persian -> Iranian Persian ell -> Greek -> Modern Greek pms -> Piedmontese -> Piemontese tpw -> Old Tupi -> Tupí

I propose zlm -> Malay (Vernacular) stay as it is. In ISO 639-3, it is listed as "Malay (individual language)", which could be confusing.

Similarly, I think kek -> Kekchi (Q'eqchi') should remain as-is for visibility.

ori -> Odia (Oriya) is another special case that I think should stay.

jiru commented 6 years ago

I wrote a script that compares CLDR’s language names against Tatoeba’s and print differences. Note that CLDR has alternate namings on the top of the "normal" name.

ISO3 ISO1       Tatoeba's name               CLDR's name (alternative naming)
-----------------------------------------------------------------------------
 abk   ab               Abkhaz                 Abkhazian
 aln  aln      Albanian (Gheg)             Gheg Albanian
 aze   az          Azerbaijani                     Azeri (short)
 ben   bn              Bengali                    Bangla
 bua  bua               Buryat                    Buriat
 crh  crh        Crimean Tatar           Crimean Turkish
 crs  crs   Seychellois Creole     Seselwa Creole French
 fry   fy              Frisian           Western Frisian
 ilo  ilo              Ilocano                     Iloko
 jam  jam      Jamaican Patois   Jamaican Creole English
 kaa  kaa           Karakalpak               Kara-Kalpak
 kal   kl          Greenlandic               Kalaallisut
 ksh  ksh               Kölsch                 Colognian
 kir   ky               Kyrgyz                   Kirghiz (variant)
 lug   lg              Luganda                     Ganda
 mrj  mrj            Hill Mari              Western Mari
 mya   my              Burmese          Myanmar Language (variant)
 nau   na              Nauruan                     Nauru
 nob   nb   Norwegian (Bokmål)          Norwegian Bokmål
 nds  nds            Low Saxon                Low German
 nya   ny            Chinyanja                    Nyanja
 oji   oj               Ojibwe                    Ojibwa
 ori   or         Odia (Oriya)                      Odia
 oss   os             Ossetian                   Ossetic
 pan   pa    Punjabi (Eastern)                   Punjabi
 pam  pam          Kapampangan                  Pampanga
 prg  prg         Old Prussian                  Prussian
 pus   ps               Pashto                    Pushto (variant)
 quc  quc              K'iche'                   Kʼicheʼ
 rif  rif              Tarifit                   Riffian
 rom  rom               Romani                    Romany
 sah  sah                Yakut                     Sakha
 ssw   ss                Swazi                     Swati
 tet  tet                Tetun                     Tetum
 tkl  tkl            Tokelauan                   Tokelau
 tsn   tn             Setswana                    Tswana
 tvl  tvl             Tuvaluan                    Tuvalu
 uig   ug               Uyghur                    Uighur (variant)
 wuu  wuu         Shanghainese                Wu Chinese
 cmn   zh   Chinese (Mandarin)                   Chinese
 cmn   zh   Chinese (Mandarin)          Mandarin Chinese (long)

Hope this helps.

trang commented 5 years ago

@cueyayotl I'll let you check and confirm the renaming suggested. I can imagine we won't have a clear answer for all the languages, so it would be nice to at least start with a list of languages we're confident to rename. We don't have to rename everything at once. For the more problematic ones, we can go step by step.

RyckRichards commented 5 years ago

@cueyayotl any workflow you'd suggest?

RyckRichards commented 5 years ago

@sabretou Now you're the one in charge to validate language requests on Tatoeba. Perhaps you might want to have a look on this :)

sabretou commented 5 years ago

Let's go ahead with the first batch of renamings. I have cleared the following for renaming.

cmn -> Chinese (Mandarin) -> Mandarin Chinese nob -> Norwegian (Bokmål) -> Norwegian Bokmål nno -> Norwegian (Nynorsk) -> Norwegian Nynorsk afb -> Arabic (Gulf) -> Gulf Arabic aln -> Albanian (Gheg) -> Gheg Albanian bcl -> Bikol (Central) -> Central Bikol cmo -> Mnong, Central -> Central Mnong dtp -> Central Dusun -> Kadazan Dusun wuu -> Shanghainese -> Wu Chinese yue -> Cantonese -> Yue Chinese

RyckRichards commented 5 years ago

Sure

trang commented 5 years ago

@sabretou I'm wondering if renaming Cantonese to Yue Chinese will not confuse our users. Looking at some of the comments of nickyeow, our main contributor in yue, he refers to the language as Cantonese. I, myself, am not very familiar with the name "Yue Chinese", while I'm more familiar with Cantonese. If I was not involved in Tatoeba, with this name change I would actually think for a moment that Cantonese has been removed from the supported languages.

I have similar concerns for Shanghainese and to a certain extent Central Dusun, as we are introducing new words as replacement of the initial words.

Actually for Shanghainese, I know there is a comment in our code saying:

// TODO to change when shanghainese will not be the only wu dialect

Meaning that we used wuu as a code for Shanghainese knowing that wuu encapsulates more than just Shanghainese. But I think that changing the name today may not be that easy, because it's been there since 2009.

I suggest to try and contact members of Tatoeba who are contributing in dtp, wuu and yue, or just active members who have those languages listed in their profile, to have their opinion about the name change.

jiru commented 5 years ago

According to Wikipedia, the "yue" iso code stands for Yue Chinese, which encompasses Cantonese as well as other varieties.

While the term Cantonese specifically refers to the prestige variety, it is often used in a broader sense for the entire Yue subgroup of Chinese, including related but largely mutually unintelligible languages and dialects such as Taishanese.

It’s a complex matter. If we want to make it easy to understand, we should use the word "Cantonese", but then we won’t ever have contributors or other Yue dialects such as Taishanese. If we want to follow the ISO standard, we should use "Yue Chinese" and include other dialects under that code, like Taishanese. However these dialects are mutually unintelligible, so it make little sense for contributors to group them under a same language on Tatoeba.

Note that since we’ve been using the name Cantonese on Tatoeba, it’s likely that we only have contributors of Cantonese, and not other Yue dialects.

jiru commented 4 years ago

Quoting Wikipedia about Wu Chinese:

Shanghainese (simplified Chinese: 上海话/上海闲话; traditional Chinese: 上海話/上海閒話; pinyin: Shànghǎihuà/Shànghǎi xiánhuà): is also a very common name, used because Shanghai is the most well-known city in the Wu-speaking region, and most people are unfamiliar with the term Wu Chinese. The use of the term Shanghainese for referring to the family is more typically used outside of China and in simplified introductions to the areas where it is spoken or to other similar topics, for example one might encounter sentences like "They speak a kind of Shanghainese in Ningbo." The term Shanghainese is never used by serious linguists to refer to anything but the variety used in Shanghai.

However, looking at the Shanghainese article:

Shanghainese belongs to the Taihu Wu subgroup, and contains vocabulary and expressions from the entire Taihu Wu area of southern Jiangsu and northern Zhejiang. With nearly 14 million speakers, Shanghainese is also the largest single form of Wu Chinese. It serves as the lingua franca of the entire Yangtze River Delta region.

So we should figure out whether sentences currently belonging to our Shanghainese corpus are all Shanghainese dialect of Taihu Wu, or also include other Wu languages.

It is worth noting that this year, there has been a proposal about splitting Wu Chinese, which is still under review by the SIL. If that proposal is accepted, it would result in the creation of Taihu Wu Chinese (among others). That would certainly help sorting out our wuu corpus and solve the naming issue.

jiru commented 4 years ago

As for Central Dusun, that name has been changed by the SIL into Kadazan Dusun in 2016 as part of a merge. According to the proposal, the new name matches better how the speakers call their own language and it encompasses more dialects, so it’s probably safe to rename.

jiru commented 4 years ago

It is worth noting that this year, there has been a proposal about splitting Wu Chinese

The proposal has been rejected.

RyckRichards commented 3 years ago

Should I work on Phase 2? Related: #936

LBeaudoux commented 4 weeks ago

Here is an updated list of Tatoeba language names that differ from their standard ISO 639-3 names.

ISO 639-3 Tatoeba language name ISO 639-3 language name
abk Abkhaz Abkhazian
acm Iraqi Arabic Mesopotamian Arabic
ain Ainu Ainu (Japan)
ang Old English Old English (ca. 450-1100)
apc North Levantine Arabic Levantine Arabic
arn Mapuche Mapudungun
ava Avar Avaric
brx Bodo Bodo (India)
bua Buryat Buriat
chn Chinook Jargon Chinook jargon
cjy Jin Chinese Jinyu Chinese
ckb Central Kurdish (Soranî) Central Kurdish
ckt Chukchi Chukot
crs Seychellois Creole Seselwa Creole French
diq Southern Zaza (Dimli) Dimli (individual language)
dtp Central Dusun Kadazan Dusun
ell Greek Modern Greek (1453-)
enm Middle English Middle English (1100-1500)
frm Middle French Middle French (ca. 1400-1600)
fro Old French Old French (842-ca. 1400)
frr North Frisian Northern Frisian
fry Frisian Western Frisian
gom Konkani (Goan) Goan Konkani
grc Ancient Greek Ancient Greek (to 1453)
hat Haitian Creole Haitian
hnj Hmong Njua (Green) Hmong Njua
hye Eastern Armenian Armenian
iii Nuosu Sichuan Yi
ike Inuktitut Eastern Canadian Inuktitut
ilo Ilocano Iloko
ina Interlingua Interlingua (International Auxiliary Language Association)
jam Jamaican Patois Jamaican Creole English
jdt Juhuri (Judeo-Tat) Judeo-Tat
kaa Karakalpak Kara-Kalpak
kal Greenlandic Kalaallisut
kam Kamba Kamba (Kenya)
kek Kekchi (Q'eqchi') Kekchí
kir Kyrgyz Kirghiz
kiu Northern Zaza (Kirmanjki) Kirmanjki (individual language)
kmr Northern Kurdish (Kurmancî) Northern Kurdish
lez Lezgi Lezghian
lim Limburgish Limburgan
liv Livonian Liv
lug Luganda Ganda
lvs Latvian Standard Latvian
mfa Kelantan-Pattani Malay Pattani Malay
mhr Meadow Mari Eastern Mari
mik Hitchiti Mikasuki
mni Meitei Manipuri
mrj Hill Mari Western Mari
mus Muskogee (Creek) Creek
mww Hmong Daw (White) Hmong Daw
nau Nauruan Nauru
nds Low German (Low Saxon) Low German
ngt Ngeq Kriang
npi Nepali Nepali (individual language)
nst Naga (Tangshang) Tase Naga
nya Chinyanja Nyanja
oar Old Aramaic Old Aramaic (up to 700 BCE)
oci Occitan Occitan (post 1500)
oji Ojibwe Ojibwa
ood O'odham Tohono O'odham
ori Odia (Oriya) Oriya (macrolanguage)
orv Old East Slavic Old Russian
ota Ottoman Turkish Ottoman Turkish (1500-1928)
pal Middle Persian (Pahlavi) Pahlavi
pam Kapampangan Pampanga
pan Punjabi (Eastern) Panjabi
pes Persian Iranian Persian
pfl Palatine German Pfaelzisch
pms Piedmontese Piemontese
pnb Punjabi (Western) Western Panjabi
prg Old Prussian Prussian
pus Pashto Pushto
qxq Qashqai Qashqa'i
rap Rapa Nui Rapanui
rom Romani Romany
run Kirundi Rundi
ryu Okinawan Central Okinawan
shi Tashelhit Tachelhit
ssw Swazi Swati
stq Saterland Frisian Saterfriesisch
swh Swahili Swahili (individual language)
syc Syriac Classical Syriac
tet Tetun Tetum
tkl Tokelauan Tokelau
tmr Jewish Babylonian Aramaic Jewish Babylonian Aramaic (ca. 200-1200 CE)
toi Tonga (Zambezi) Tonga (Zambia)
ton Tongan Tonga (Tonga Islands)
tsn Setswana Tswana
tts Isan Northeastern Thai
tvl Tuvaluan Tuvalu
uig Uyghur Uighur
war Waray Waray (Philippines)
wuu Shanghainese Wu Chinese
yua Yucatec Maya Yucateco
yue Cantonese Yue Chinese
zea Zeelandic Zeeuws
zlm Malay (Vernacular) Malay (individual language)
zsm Malay Standard Malay