WICG / handwriting-recognition

Handwriting Recognition Web API Proposal
https://wicg.github.io/handwriting-recognition/
Other
77 stars 17 forks source link

Language fallbacks #2

Closed r12a closed 2 years ago

r12a commented 4 years ago

https://github.com/WICG/handwriting-recognition/blob/main/explainer.md#recognition-hints

If there's no dedicated models for that language tag, the recognizer falls back to the macro language (zh-CN becomes zh). If the macro language is not supported, the recognizer fall back to the default language of the browser (i.e. navigator.language).

What if the browser default language is set to something that the recogniser cannot deal with?

I suspect that, like for hyphens in CSS, it must be required that a language be selected for this to work, and the selection would be from a list of languages supported by the recogniser.

wacky6 commented 3 years ago

If the fallback language isn't supported at all, I think the recognizer should return null as the prediction result.

This aside, I'd expect developers to check language support with queryHandwritingRecognizerSupport, decide if the support meets their requirement, and provide a reasonable hint. Providing an unsupported language as a hint means this hint is invalid, and the recognizer can ignore it and do whichever it finds most appropriate.

Not providing a hint means the recognizer should an appropriate language among it's supported list (it's okay to be a wrong one, it just results in the recognition result being unusable). Considering language

Created a PR for fallback (if no language is supported): https://github.com/WICG/handwriting-recognition/pull/8

r12a commented 3 years ago
  • If there's no dedicated models for that script, the recognizer falls back to the macro language (zh-Hans falls back to zh).

Is there a terminology issue here? zh is indeed a macrolanguage, but not every language tag that has a script subtag starts with a macrolanguage. They do, however, all start with a 'language subtag'. Perhaps the sentence should say that it falls back to the language subtag, or even better removes subtags until a match is found. That way zh-Hant-HK could fall back to a generic traditional chinese recogniser, or a generic chinese one.

However, any such fallback will probably work only if the recogniser associates the language tag implicitly with a given script, because the process of recognising is very much tied to the orthography used. This is ok for many languages, and indeed BCP47 rules actively encourage association of a default script with bare language tags, but not for all. For example, if az-arab falls back to az, this is of no help if the Azeri recogniser only works with cyrillic.

wacky6 commented 3 years ago

Indeed a terminology issue on my side. :)

Updated to "remove the last subtag until there is a match, ...", since it's a straight forward rule. https://github.com/WICG/handwriting-recognition/commit/0384cbd4c8f1f973fe54075e8861df32de3af9ab

As for "associates the language tag implicitly with a given script". I think this is the case for most recognizer implemtations. I can't speak for all implementations, but the one we have at Google will attempt to recognize scripts that make sense to appear in that language.

wacky6 commented 2 years ago

Closing. The terminologies have been updated.