Open sabretou opened 6 years ago
I wrote a script that compares CLDR’s language names against Tatoeba’s and print differences. Note that CLDR has alternate namings on the top of the "normal" name.
ISO3 ISO1 Tatoeba's name CLDR's name (alternative naming)
-----------------------------------------------------------------------------
abk ab Abkhaz Abkhazian
aln aln Albanian (Gheg) Gheg Albanian
aze az Azerbaijani Azeri (short)
ben bn Bengali Bangla
bua bua Buryat Buriat
crh crh Crimean Tatar Crimean Turkish
crs crs Seychellois Creole Seselwa Creole French
fry fy Frisian Western Frisian
ilo ilo Ilocano Iloko
jam jam Jamaican Patois Jamaican Creole English
kaa kaa Karakalpak Kara-Kalpak
kal kl Greenlandic Kalaallisut
ksh ksh Kölsch Colognian
kir ky Kyrgyz Kirghiz (variant)
lug lg Luganda Ganda
mrj mrj Hill Mari Western Mari
mya my Burmese Myanmar Language (variant)
nau na Nauruan Nauru
nob nb Norwegian (Bokmål) Norwegian Bokmål
nds nds Low Saxon Low German
nya ny Chinyanja Nyanja
oji oj Ojibwe Ojibwa
ori or Odia (Oriya) Odia
oss os Ossetian Ossetic
pan pa Punjabi (Eastern) Punjabi
pam pam Kapampangan Pampanga
prg prg Old Prussian Prussian
pus ps Pashto Pushto (variant)
quc quc K'iche' Kʼicheʼ
rif rif Tarifit Riffian
rom rom Romani Romany
sah sah Yakut Sakha
ssw ss Swazi Swati
tet tet Tetun Tetum
tkl tkl Tokelauan Tokelau
tsn tn Setswana Tswana
tvl tvl Tuvaluan Tuvalu
uig ug Uyghur Uighur (variant)
wuu wuu Shanghainese Wu Chinese
cmn zh Chinese (Mandarin) Chinese
cmn zh Chinese (Mandarin) Mandarin Chinese (long)
Hope this helps.
@cueyayotl I'll let you check and confirm the renaming suggested. I can imagine we won't have a clear answer for all the languages, so it would be nice to at least start with a list of languages we're confident to rename. We don't have to rename everything at once. For the more problematic ones, we can go step by step.
@cueyayotl any workflow you'd suggest?
@sabretou Now you're the one in charge to validate language requests on Tatoeba. Perhaps you might want to have a look on this :)
Let's go ahead with the first batch of renamings. I have cleared the following for renaming.
cmn -> Chinese (Mandarin) -> Mandarin Chinese nob -> Norwegian (Bokmål) -> Norwegian Bokmål nno -> Norwegian (Nynorsk) -> Norwegian Nynorsk afb -> Arabic (Gulf) -> Gulf Arabic aln -> Albanian (Gheg) -> Gheg Albanian bcl -> Bikol (Central) -> Central Bikol cmo -> Mnong, Central -> Central Mnong dtp -> Central Dusun -> Kadazan Dusun wuu -> Shanghainese -> Wu Chinese yue -> Cantonese -> Yue Chinese
Sure
@sabretou I'm wondering if renaming Cantonese to Yue Chinese will not confuse our users. Looking at some of the comments of nickyeow, our main contributor in yue
, he refers to the language as Cantonese. I, myself, am not very familiar with the name "Yue Chinese", while I'm more familiar with Cantonese. If I was not involved in Tatoeba, with this name change I would actually think for a moment that Cantonese has been removed from the supported languages.
I have similar concerns for Shanghainese and to a certain extent Central Dusun, as we are introducing new words as replacement of the initial words.
Actually for Shanghainese, I know there is a comment in our code saying:
// TODO to change when shanghainese will not be the only wu dialect
Meaning that we used wuu
as a code for Shanghainese knowing that wuu
encapsulates more than just Shanghainese. But I think that changing the name today may not be that easy, because it's been there since 2009.
I suggest to try and contact members of Tatoeba who are contributing in dtp
, wuu
and yue
, or just active members who have those languages listed in their profile, to have their opinion about the name change.
According to Wikipedia, the "yue" iso code stands for Yue Chinese, which encompasses Cantonese as well as other varieties.
While the term Cantonese specifically refers to the prestige variety, it is often used in a broader sense for the entire Yue subgroup of Chinese, including related but largely mutually unintelligible languages and dialects such as Taishanese.
It’s a complex matter. If we want to make it easy to understand, we should use the word "Cantonese", but then we won’t ever have contributors or other Yue dialects such as Taishanese. If we want to follow the ISO standard, we should use "Yue Chinese" and include other dialects under that code, like Taishanese. However these dialects are mutually unintelligible, so it make little sense for contributors to group them under a same language on Tatoeba.
Note that since we’ve been using the name Cantonese on Tatoeba, it’s likely that we only have contributors of Cantonese, and not other Yue dialects.
Quoting Wikipedia about Wu Chinese:
Shanghainese (simplified Chinese: 上海话/上海闲话; traditional Chinese: 上海話/上海閒話; pinyin: Shànghǎihuà/Shànghǎi xiánhuà): is also a very common name, used because Shanghai is the most well-known city in the Wu-speaking region, and most people are unfamiliar with the term Wu Chinese. The use of the term Shanghainese for referring to the family is more typically used outside of China and in simplified introductions to the areas where it is spoken or to other similar topics, for example one might encounter sentences like "They speak a kind of Shanghainese in Ningbo." The term Shanghainese is never used by serious linguists to refer to anything but the variety used in Shanghai.
However, looking at the Shanghainese article:
Shanghainese belongs to the Taihu Wu subgroup, and contains vocabulary and expressions from the entire Taihu Wu area of southern Jiangsu and northern Zhejiang. With nearly 14 million speakers, Shanghainese is also the largest single form of Wu Chinese. It serves as the lingua franca of the entire Yangtze River Delta region.
So we should figure out whether sentences currently belonging to our Shanghainese corpus are all Shanghainese dialect of Taihu Wu, or also include other Wu languages.
It is worth noting that this year, there has been a proposal about splitting Wu Chinese, which is still under review by the SIL. If that proposal is accepted, it would result in the creation of Taihu Wu Chinese (among others). That would certainly help sorting out our wuu corpus and solve the naming issue.
As for Central Dusun, that name has been changed by the SIL into Kadazan Dusun in 2016 as part of a merge. According to the proposal, the new name matches better how the speakers call their own language and it encompasses more dialects, so it’s probably safe to rename.
It is worth noting that this year, there has been a proposal about splitting Wu Chinese
The proposal has been rejected.
Should I work on Phase 2? Related: #936
Here is an updated list of Tatoeba language names that differ from their standard ISO 639-3 names.
ISO 639-3 | Tatoeba language name | ISO 639-3 language name |
---|---|---|
abk | Abkhaz | Abkhazian |
acm | Iraqi Arabic | Mesopotamian Arabic |
ain | Ainu | Ainu (Japan) |
ang | Old English | Old English (ca. 450-1100) |
apc | North Levantine Arabic | Levantine Arabic |
arn | Mapuche | Mapudungun |
ava | Avar | Avaric |
brx | Bodo | Bodo (India) |
bua | Buryat | Buriat |
chn | Chinook Jargon | Chinook jargon |
cjy | Jin Chinese | Jinyu Chinese |
ckb | Central Kurdish (Soranî) | Central Kurdish |
ckt | Chukchi | Chukot |
crs | Seychellois Creole | Seselwa Creole French |
diq | Southern Zaza (Dimli) | Dimli (individual language) |
dtp | Central Dusun | Kadazan Dusun |
ell | Greek | Modern Greek (1453-) |
enm | Middle English | Middle English (1100-1500) |
frm | Middle French | Middle French (ca. 1400-1600) |
fro | Old French | Old French (842-ca. 1400) |
frr | North Frisian | Northern Frisian |
fry | Frisian | Western Frisian |
gom | Konkani (Goan) | Goan Konkani |
grc | Ancient Greek | Ancient Greek (to 1453) |
hat | Haitian Creole | Haitian |
hnj | Hmong Njua (Green) | Hmong Njua |
hye | Eastern Armenian | Armenian |
iii | Nuosu | Sichuan Yi |
ike | Inuktitut | Eastern Canadian Inuktitut |
ilo | Ilocano | Iloko |
ina | Interlingua | Interlingua (International Auxiliary Language Association) |
jam | Jamaican Patois | Jamaican Creole English |
jdt | Juhuri (Judeo-Tat) | Judeo-Tat |
kaa | Karakalpak | Kara-Kalpak |
kal | Greenlandic | Kalaallisut |
kam | Kamba | Kamba (Kenya) |
kek | Kekchi (Q'eqchi') | Kekchí |
kir | Kyrgyz | Kirghiz |
kiu | Northern Zaza (Kirmanjki) | Kirmanjki (individual language) |
kmr | Northern Kurdish (Kurmancî) | Northern Kurdish |
lez | Lezgi | Lezghian |
lim | Limburgish | Limburgan |
liv | Livonian | Liv |
lug | Luganda | Ganda |
lvs | Latvian | Standard Latvian |
mfa | Kelantan-Pattani Malay | Pattani Malay |
mhr | Meadow Mari | Eastern Mari |
mik | Hitchiti | Mikasuki |
mni | Meitei | Manipuri |
mrj | Hill Mari | Western Mari |
mus | Muskogee (Creek) | Creek |
mww | Hmong Daw (White) | Hmong Daw |
nau | Nauruan | Nauru |
nds | Low German (Low Saxon) | Low German |
ngt | Ngeq | Kriang |
npi | Nepali | Nepali (individual language) |
nst | Naga (Tangshang) | Tase Naga |
nya | Chinyanja | Nyanja |
oar | Old Aramaic | Old Aramaic (up to 700 BCE) |
oci | Occitan | Occitan (post 1500) |
oji | Ojibwe | Ojibwa |
ood | O'odham | Tohono O'odham |
ori | Odia (Oriya) | Oriya (macrolanguage) |
orv | Old East Slavic | Old Russian |
ota | Ottoman Turkish | Ottoman Turkish (1500-1928) |
pal | Middle Persian (Pahlavi) | Pahlavi |
pam | Kapampangan | Pampanga |
pan | Punjabi (Eastern) | Panjabi |
pes | Persian | Iranian Persian |
pfl | Palatine German | Pfaelzisch |
pms | Piedmontese | Piemontese |
pnb | Punjabi (Western) | Western Panjabi |
prg | Old Prussian | Prussian |
pus | Pashto | Pushto |
qxq | Qashqai | Qashqa'i |
rap | Rapa Nui | Rapanui |
rom | Romani | Romany |
run | Kirundi | Rundi |
ryu | Okinawan | Central Okinawan |
shi | Tashelhit | Tachelhit |
ssw | Swazi | Swati |
stq | Saterland Frisian | Saterfriesisch |
swh | Swahili | Swahili (individual language) |
syc | Syriac | Classical Syriac |
tet | Tetun | Tetum |
tkl | Tokelauan | Tokelau |
tmr | Jewish Babylonian Aramaic | Jewish Babylonian Aramaic (ca. 200-1200 CE) |
toi | Tonga (Zambezi) | Tonga (Zambia) |
ton | Tongan | Tonga (Tonga Islands) |
tsn | Setswana | Tswana |
tts | Isan | Northeastern Thai |
tvl | Tuvaluan | Tuvalu |
uig | Uyghur | Uighur |
war | Waray | Waray (Philippines) |
wuu | Shanghainese | Wu Chinese |
yua | Yucatec Maya | Yucateco |
yue | Cantonese | Yue Chinese |
zea | Zeelandic | Zeeuws |
zlm | Malay (Vernacular) | Malay (individual language) |
zsm | Malay | Standard Malay |
As we are introducing the new language selector, perhaps we should have languages match their ISO 639-3 names. This is because some languages used parentheses or alternate names for easier discovery earlier.
Here are my suggestions:
Language Code -> Current Name -> Proposed Name
cmn -> Chinese (Mandarin) -> Mandarin Chinese nob -> Norwegian (Bokmål) -> Norwegian Bokmål nno -> Norwegian (Nynorsk) -> Norwegian Nynorsk nst -> Naga (Tangshang) -> Tase Naga pan -> Punjabi (Eastern) -> Punjabi (Punjabi is by far the more popular spelling variant, so I recommend going with that. Alternately, we could add 'Panjabi' in parentheses). zsm -> Malay -> Standard Malay mww -> Hmong Daw (White) -> Hmong Daw afb -> Arabic (Gulf) -> Gulf Arabic pnb -> Punjabi (Western) -> Western Punjabi (I propose 'Punjabi' over 'Panjabi' for the same reason as above) aln -> Albanian (Gheg) -> Gheg Albanian jdt -> Juhuri (Judeo-Tat) -> Judeo-Tat cjy -> Chinese (Jin) -> Jinyu Chinese hnj -> Hmong Njua (Green) -> Hmong Njua bcl -> Bikol (Central) -> Central Bikol pfl -> Palatine German -> Pfaelzisch orv -> Old East Slavic -> Old Russian prg -> Old Prussian -> Prussian cmo -> Mnong, Central -> Central Mnong acm -> Iraqi Arabic -> Mesopotamian Arabic jam -> Jamaican Patois -> Jamaican Creole English mhr -> Meadow Mari -> Eastern Mari mrj -> Hill Mari -> Western Mari dtp -> Central Dusun -> Kadazan Dusun wuu -> Shanghainese -> Wu Chinese yue -> Cantonese -> Yue Chinese pes -> Persian -> Iranian Persian ell -> Greek -> Modern Greek pms -> Piedmontese -> Piemontese tpw -> Old Tupi -> Tupí
I propose zlm -> Malay (Vernacular) stay as it is. In ISO 639-3, it is listed as "Malay (individual language)", which could be confusing.
Similarly, I think kek -> Kekchi (Q'eqchi') should remain as-is for visibility.
ori -> Odia (Oriya) is another special case that I think should stay.