WICG / handwriting-recognition

Handwriting Recognition Web API Proposal
https://wicg.github.io/handwriting-recognition/
Other
77 stars 17 forks source link

Use proper BCP 47 language tags for Chinese #1

Closed r12a closed 2 years ago

r12a commented 4 years ago

https://github.com/WICG/handwriting-recognition/blob/main/explainer.md

Languages are identified by IETF BCP 47 language tags (e.g. en, zh-CN). If there's no dedicated models for that language tag, the recognizer falls back to the macro language (zh-CN becomes zh).

zh-CN is presumably meant to indicate Simplified Chinese, which is also used in Singapore. That's why it is better to use zh-Hans as the language tag, rather than zh-CN (and zh-Hant, rather than zh-TW).

Please change the example.

wacky6 commented 3 years ago

I agree. zh-CN is not the technically correct way of unambiguously specifying a language, but its arguably more commonly used on the Web (Accept-Language, navigator.languages).

So I'd keep it in the example. In the meanwhile, I've created a PR to include script subtag in the example.

See https://github.com/WICG/handwriting-recognition/pull/8 (in conjunction with https://github.com/WICG/handwriting-recognition/issues/2)

r12a commented 3 years ago

I think the time when it was commonly used on the Web as a way to refer to Simplified Chinese was many years ago. Nowadays zh-Hans works fine pretty well everywhere. And by using zh-CN prominently in your example you only promote the incorrect usage. So i still think you should change it. I can refer this issue to the i18n WG if you like.

I think it's fine to mention zh-CN as something that a user may type in, but which should be interpreted to mean zh-Hans, which you do in #8. But i think it should be framed to look like the recogniser is correcting incorrect input (which it is, since the actual script/orthography is very important for handwriting). I see an implication in the quoted text above (esp. because it doesn't even mention zh-Hans) that zh-CN is an appropriate way of referring to SC. It's really not. It's only appropriate if the language tag ignores script information and actually focuses on the region – which it may do, for example, when what's important is the spoken language (although that's problematic wrt zh too unless there's an implicit association of zh with cmn), or the locale (eg. for location services, legal reasons, etc.)

wacky6 commented 3 years ago

I woundn't say using "zh-CN" here is incorrect, given:

  1. The attribute is language, not script. Language is a broad term. "zh-CN" basically means "Chinese used in Mainland China".

    • In fact, simplified chinese, traditional chinese and latin alphabets are all used in Mainland China.
    • Assuming we promote "zh-Hans", would "Hans" exclude characters from latin alphabet (from the recognizer), I'm not sure.
    • We don't want the API to say "you need to unambiguously specify all the scripts". It's probably more confusing than just specifying the region.
  2. The script can be determined by using some established rules (e.g. Unicode likely subtag). "zh-CN" gets interpreted to "zh-Hans-CN". Though this precise interpretation may be undesirable (see the point above).

    • The recognizer may have to include more scripts (i.e. Latn + Hans / Hani).
  3. From API ergonomic point of view, we don't want to give developers the impression that they need / should convert "zh-CN" to "zh-Hans" so they use the API correctly.

    • I don't know of a simple way to get the script for any language tag in the browser. My feeling is developers will use "zh-CN" (even it's technically incorrect for a script), as long as it works (if the browser interprets reasonably).
    • It's perfectly okay for a website to target users in a region, and don't worry about the exact script being used (and let the browser deal with it). The recognizer is free to (and should) find out the scripts (appropriate for that region, and include all of them).
wacky6 commented 3 years ago

Hi @r12a , we have a question about language tag for non-standard "languages".

We have handwriting models for recognizing geometric shapes and/or user guestures (e.g. a square), what language tag could we use for this case?

I see there is a "zxx" primary tag for "No linguistic content; Not applicable". Is it suitable? For example, use "zxx-Shape" for the above recognizer. Or is private subtags more suitable?

r12a commented 3 years ago

I think it's best to avoid private subtags if at all possible, and zxx may indeed be what you need, but i refer this question to @aphillips, since he's a co-author of BCP-47.

aphillips commented 3 years ago

There are really two choices that occur to me here. One is to use zxx. The other would be und (Undetermined). The und tag is usually imputed to content with no language tag and it is used in CLDR and locale systems (such as JS's Intl.Locale) to mean the "root" locale. This might be more like what you intend. Regardless of what primary language subtag you choose, you should not use invalid tags such as zxx-shape. You might use a private-use tag, though, such as zxx-x-shape or und-x-symbols.

wacky6 commented 2 years ago

Closing this issue.

zh_CN and zh_Hans convey different meanings, "zh_CN" means "Chinese as used in mainland China", "zh_Hans" means "Simplified Chinese regardless of where it's used". Web applications should choose whichever is more suitable for their use cases.

We allow the browser implementation and the underlying recognizer to make reasonable assumptions about the script (considering different handwriting recognizer implementations identifies their models differently).


For shape / user gesture models, we will use a zxx private tag ("zxx-x-shape"), following this precedence: MLKit shape detection models.