Closed r12a closed 2 years ago
I agree. zh-CN is not the technically correct way of unambiguously specifying a language, but its arguably more commonly used on the Web (Accept-Language, navigator.languages).
So I'd keep it in the example. In the meanwhile, I've created a PR to include script subtag in the example.
See https://github.com/WICG/handwriting-recognition/pull/8 (in conjunction with https://github.com/WICG/handwriting-recognition/issues/2)
I think the time when it was commonly used on the Web as a way to refer to Simplified Chinese was many years ago. Nowadays zh-Hans works fine pretty well everywhere. And by using zh-CN prominently in your example you only promote the incorrect usage. So i still think you should change it. I can refer this issue to the i18n WG if you like.
I think it's fine to mention zh-CN as something that a user may type in, but which should be interpreted to mean zh-Hans, which you do in #8. But i think it should be framed to look like the recogniser is correcting incorrect input (which it is, since the actual script/orthography is very important for handwriting). I see an implication in the quoted text above (esp. because it doesn't even mention zh-Hans) that zh-CN is an appropriate way of referring to SC. It's really not. It's only appropriate if the language tag ignores script information and actually focuses on the region – which it may do, for example, when what's important is the spoken language (although that's problematic wrt zh too unless there's an implicit association of zh with cmn), or the locale (eg. for location services, legal reasons, etc.)
I woundn't say using "zh-CN" here is incorrect, given:
The attribute is language
, not script
. Language is a broad term. "zh-CN" basically means "Chinese used in Mainland China".
The script can be determined by using some established rules (e.g. Unicode likely subtag). "zh-CN" gets interpreted to "zh-Hans-CN". Though this precise interpretation may be undesirable (see the point above).
From API ergonomic point of view, we don't want to give developers the impression that they need / should convert "zh-CN" to "zh-Hans" so they use the API correctly.
Hi @r12a , we have a question about language tag for non-standard "languages".
We have handwriting models for recognizing geometric shapes and/or user guestures (e.g. a square), what language tag could we use for this case?
I see there is a "zxx" primary tag for "No linguistic content; Not applicable". Is it suitable? For example, use "zxx-Shape" for the above recognizer. Or is private subtags more suitable?
I think it's best to avoid private subtags if at all possible, and zxx
may indeed be what you need, but i refer this question to @aphillips, since he's a co-author of BCP-47.
There are really two choices that occur to me here. One is to use zxx
. The other would be und
(Undetermined). The und
tag is usually imputed to content with no language tag and it is used in CLDR and locale systems (such as JS's Intl.Locale
) to mean the "root" locale. This might be more like what you intend. Regardless of what primary language subtag you choose, you should not use invalid tags such as zxx-shape
. You might use a private-use tag, though, such as zxx-x-shape
or und-x-symbols
.
Closing this issue.
zh_CN and zh_Hans convey different meanings, "zh_CN" means "Chinese as used in mainland China", "zh_Hans" means "Simplified Chinese regardless of where it's used". Web applications should choose whichever is more suitable for their use cases.
We allow the browser implementation and the underlying recognizer to make reasonable assumptions about the script (considering different handwriting recognizer implementations identifies their models differently).
For shape / user gesture models, we will use a zxx private tag ("zxx-x-shape"), following this precedence: MLKit shape detection models.
https://github.com/WICG/handwriting-recognition/blob/main/explainer.md
zh-CN is presumably meant to indicate Simplified Chinese, which is also used in Singapore. That's why it is better to use zh-Hans as the language tag, rather than zh-CN (and zh-Hant, rather than zh-TW).
Please change the example.