WICG / translation-api

A proposal for a web translation API
Other
80 stars 1 forks source link

language tag handling needs more attention #11

Open aphillips opened 1 month ago

aphillips commented 1 month ago

Language tag handling

Tentatively, pending consultation with internationalization and translation API experts, we propose the following model. Each user agent has a list of (language tag, availability) pairs, which is the same one returned by translation.supportedLanguages(). Only exact matches for entries in that list will be used for the API.

The proposed mechanisms don't make sense. They require absolute tag matches in order to work, when the normal way for translation and locale-based mechanisms to work is either BCP47 Lookup or BCP47 Filtering.

Generally, for this type of API, Lookup is the preferred mechanism, usually with some additional tailoring (the insertion of missing subtags: Intl already provides this).

For example, if a system supports ja and en, then canTranslate() should match requests for en-US, en-GB, ja-JP or ja-u-ca-japanese, but not requests for ena, fr, or zh-Hans.

Failing to provide this sort of support would mean that implementations would have to provide dozens or hundreds of tags that they "support" and/or would require the caller to massage the tag (instead of passing it blindly). This is especially the case in the "download" case, in which a site might generate dozens of spurious downloads due to mutations of the language tag.

Note: a deeper discussion, possibly in a joint teleconference, might be useful here.

aphillips commented 1 month ago

I'll add some additional color here as a personal comment here.

Note that there is a tension between source and target language tags. Most translation systems can consume a variety of different orthographic variations of a language to produce a given target language. For example, a language arc such as en=>fr might be able to consume both en-US and en-GB flavo(u)rs of the language (as well as similar-yet-different varieties, such as en-CA, en-AU, etc.) to produce some form of French. In most cases, that language arc will produce a specific form of French, e.g. fr-FR. It might be important to describe very specifically the source and target varieties for users selecting which language arc/language model to download, but equally important not to discriminate between these varieties in the API at runtime (where the additional specificity does more harm than good, such as repeated requests to download additional models, which turn out to be identical to the one already installed).

Note that script and macrolanguage differences remain important here, even when the language tags don't always specify the script. For example, a zh=>en language arc is probably supporting zh-Hans=>en rather than "any" variety of Chinese, since Simplified Chinese is most common. Similarly, tags such as zh-CN, zh-TW, zh-SG, zh-HK or zh-MO each imply a script subtag (zh-Hans-CN, zh-Hant-TW, etc. [they also imply that the language in question is cmn and not, for example, yue]). Allowing implementations to do matching or best-fit matching in canTranslate is probably more helpful than making the API list all the potential variations of language tags (in practice, all of the zh tags are either zh-Hant or zh-Hans, with the region indicating locale differences).

domenic commented 2 weeks ago

Thanks very much for your comments here. I have learned many new things. Let me try to get more concrete and propose a solution, to see if I've understood correctly.

First, we have to recognize that the ground truth of what is supported is a per-user-agent set of machine learning translation models. These models could have more specific, or less specific capabilities. It depends on how they were trained. Some semi-realistic examples:

(Apologies for my lack of knowledge of Chinese... I hope it doesn't sidetrack the examples too badly.)

Given this sort of ground truth, we need an algorithm that takes in arbitrary language tags source and target supplied by the web developer, and then selects the right translation model to use (or download) from this predefined list.

Here is one guess at such an algorithm:

I think this algorithm works pretty well, although I'm still fuzzy on the best way to set up the list of supported language pairs. For example, if we set up jp-*-*-*-*-*-* => en-Latn-US for the second translation model, then I think the algorithm would return that translation model if given source = "jp-Brai". So probably my parenthetical note about having several script-specific entries is better?

aphillips commented 2 weeks ago

Lots to unpack here. I think it would help to get more involvement from the translation/localization community, who deal with these issues on a daily basis.

General notes to help the conversation along:

we have to recognize that the ground truth of what is supported is a per-user-agent set of machine learning translation models

There are two problems that you have here: selection and description. Selection refers to the (sometimes human-involved) process of choosing which language arcs can be applied to a given input text and then employing the best one for the task. Description involved making clear the internal settings/limitations of a given language arc.

For example, an arc such as en => fr might support any English variety, including either US or UK/International orthographic variations (it doesn't need to care how you spelled jail/gaol or colo(u)r and it can deal with you calling it the sidewalk or the pavement). The output, obviously, will be in French. But which French? It might use fr-FR as "Standard French". Does it also use the fr-FR locale to format dates and numbers included in the translation? What if the user prefers a regionally variant formatting, such as fr-SN (French/Senegal)? At the same time, in many cases, since the translation will not be exact, users might not care about the many many options.

If the there is a reverse language arc available, the output will not just be en, since it must at least choose between en-US and en-001 (aka en-GB) orthographic variations (just the French one had to choose between, say fr-FR and fr-CA). I'm simplifying here, so don't quote the examples against me πŸ˜„

The en-001 in my example points up something else about MT language arc models. Many use "artificial" languages to be more general (or because MT cannot be so precise). For example, es-419 (Latin American Spanish) is a language spoken by no one, but read/consumed by many Spanish speakers. Modern Standard Arabic (ar) is a language that is both written and read by very many Arabic speakers, but not exactly anyone's spoken language (it's complicated). Our technology isn't so good that translations produce idiomatic replacements ("pot calling a kettle black" => "λ˜₯ 묻은 κ°œκ°€ 겨 묻은 개 λ‚˜λ¬΄λž€λ‹€" (dog stained with poo laughing at dog stained with rice: I copied the Korean from elsewhere, so it might be horribly wrong))

You might support this by using lists of tags on either side of the arc description or using language ranges. User's might prefer labels like en => fr with some information that it's really en (en-001, en-US, en-GB, en-AU, en-AE, ...) => fr (fr-FR)

domenic commented 1 day ago

Thanks again for your help. I appreciate your general notes and corrections. I used the expanded 7-segment format for the extended language tags because I otherwise found it confusing, but I appreciate that people who have more experience in the field don't need that.

I agree with your framing of selection vs. description. In terms of the API I think that comes down to:

Anyway, I think I got too ambitious trying to give hypothetical examples and a full algorithm. Let me try to be more concrete. I'll focus just on description for now to scope it down further.

Let's say I was going to ship a translation API representing the capabilities of Google Translate's Japanese to English mode. Here are some representative inputs and outputs:

Input Output
元気ですか? How are you?
γ’γ‚“γγ§γ™γ‹οΌŸ How are you?
ゲンキデスカ Are you well?
genkidesuka how are you
β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁ β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁
π˜π‡π€π†π—π†π”π‡ππŠπ—π€ π˜π‡π€π†π—π†π”π‡ππŠπ—π€
γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ
硐び぀き connection
2ドル 2 dollars
携帯 cell phone
色 color
いろ colour
iro iro
irohaaoudesu The color is blue.

What should the answers be to the following, in your opinion?

canTranslate("ja", "en");           // Presumably this should work

canTranslate("ja", "en-US");        // "color" (like 色)
canTranslate("ja", "en-GB");        // "colour" (like いろ); "mobile phone" instead of "cell phone"
canTranslate("ja", "en-SG");        // "2 dollar" instead of "2 dollars"
canTranslate("ja", "en-150");       // "mobile" instead of "cell phone"

canTranslate("ja", "en-GB-oed");    // I think this would require 硐び぀き => "connexion"

canTranslate("ja", "en-Latn");      // Should this work?
canTranslate("ja", "en-Brai");      // Presumably should not work
canTranslate("ja", "en-Dsrt");      // Presumably should not work

canTranslate("ja", "en-x-pirate");  // Presumably should not work, unless we blanket grant x-?
canTranslate("ja", "en-x-lolcat");  // Presumably should not work, unless we blanket grant x-?

// Various unknown subtags cases, how should these work?
canTranslate("ja", "en-asdf");
canTranslate("ja", "en-x-asdf");
canTranslate("ja", "en-US-asdf");
canTranslate("ja", "en-US-x-asdf");
canTranslate("ja", "en-asdf-asdf");

canTranslate("ja-JP", "en");        // Presumably this should work
canTranslate("ja-JP-Jpan", "en");   // Should this work, or is it bad because of the Suppress-Script?
canTranslate("ja-JP-Hrkt", "en");   // Should this work? It seems to.
canTranslate("ja-Kana", "en");      // Should this work? It seems to.
canTranslate("ja-Latn", "en");      // Should this work? It did for "genkidesuka"/"irohaaoudesu" but not for "iro".

canTranslate("ja-Braille", "en");   // Presumably shouldn't work ("β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁" example)
canTranslate("ja-Bopo", "en");      // Presumably shouldn't work ("γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ" example)
canTranslate("ja-Dsrt", "en");      // Presumably shouldn't work ("π˜π‡π€π†π—π†π”π‡ππŠπ—π€" example)

// Using the rarely-used jpx "collection" tag; should it work?
canTranslate("jpx-ja", "en");
canTranslate("jpx-Jpan", "en");

// Unusual/unknown subtag cases; how should they work?
canTranslate("ja-KR", "en");
canTranslate("ja-US", "en");
canTranslate("ja-asdf", "en");
canTranslate("ja-Jpan-JP-x-osaka", "en");
canTranslate("ja-JP-u-ca-japanese", "en");
canTranslate("ja-x-kansai", "en");
canTranslate("ja-JP-u-sd-jpjp", "en");

If you think there's a clear algorithm that resolves a lot of these cases, feel free to suggest that instead of answering each one.

aphillips commented 1 day ago

For the source languages in your examples, all of the ja tags match whatever ja-* tagged language arcs that are installed.

The longer tags present some questions. (note that ja-JP-Jpan etc. should be ja-Jpan-JP etc.; note that ja-Braille is not valid and presumably ja-Brai is intended.)

If the user specifies a regional variation on the source side, they might want the tag to fall back when matching (that is, use BCP47 Lookup), because the source language is not visible in the output and because translation engines are usually less sensitive to linguistic variations. If the text is written in a non-default script, the translation engine might prefer if the text were transliterated or might (as in the Deseret example) not know what to do with it and pass it through. In either case, there is no harm is "losing" distinction found on tags like ja-KR or ja-Jpan-JP-x-osaka to find the ja=>en engine.

Suppress-Script tags can interfere with matching when matching is done by strict string comparison of the tags. That is, the range ja-Jpan-JP does not match the tag ja-JP because it is not a prefix of ja-JP. A range like ja-Latn provides valuable information to the translation engine, but the engine would have to decide if it did something special with the information.

Private use sequences (starting with -x-) are usually default-ignorable. Implementations could decide to support specific private use sequences, of course. The -u and -t extensions are also probably ignorable, although -t would give the translation engine a lot of information about the transformation (transliteration of the script) previously applied. On the source side, if you used Lookup, all of your tags would work, even the incorrect ones.

On the target side, there is some question in my mind about what canTranslate means. The language arc ja=>en produces output that a speaker of en-SG, en-GB-oxendict (en-GB-oed is grandfathered and deprecated) or en-asdf-asdf would understand. So, if installed, the result should be yes (er, readily πŸ˜ƒ).

On the other hand, as your examples point out, the additional subtags represent variations that the user might want. US vs. UK spelling variation or, UK vs. OED spelling variation (one variation oxendict implies is that internationalisation is spelled with a z, e.g. internationalization).

This suggests that script or region subtags (and maybe variants) in the user's specified range should not be ignored. Even if the ja=>en arc can process a request like ja=>en-Brai, it might reasonably reject it, not being able to produce the required transliteration. Locale variation, such as your en-SG example (I do not agree with your expected output) might be applied for formattable values like time, date, numbers, and such, but might or might not affect orthographic variation. The list of regional subtags is likely to exceed the range of available variation. Using ja as the example, ja-JP is almost certainly available, but ja-CR (Costa Rica) probably has no meaning?

From a standards perspective, could we say that it is implementation defined how the matching takes place (implementation here meaning "of the translation engine", not the API)? Google Translate can decide whether it can "readily" handle a given tag as output or not and the answer might vary depending on the specific language arc.