WICG / handwriting-recognition

Handwriting Recognition Web API Proposal
https://wicg.github.io/handwriting-recognition/
Other
77 stars 17 forks source link

Definition of grapheme cluster #3

Closed r12a closed 2 years ago

r12a commented 4 years ago

https://github.com/WICG/handwriting-recognition/blob/main/explainer.md#recognition-hints

each string represents a grapheme cluster (a user-visible character)

A user can see all visible characters, but the grapheme (to which the grapheme cluster attempts an approximation) is a user-perceived unit of the orthography, and usually specific to a given editing operation.

You may do better here to indicate that each string contains a 'grapheme' (a user-percieved unit of the orthography), which may also correspond to a Unicode 'grapheme cluster'.

Note that, Unicode grapheme clusters don't cover all user-perceived graphemes, esp. in many Brahmi-derived scripts.

It would certainly be useful to consider some character groupings as units, eg. Tamil கு (ku) since it's hard to separate the constituents. Whether its necessary or desirable to treat Balinese ᬓ᭄ᬱᭀ as a single unit i'm not so sure.

Note however that the Balinese, like many complex scripts, will require recognised glyphs to be paired and reordered to compose the actual character sequence (the first and last glyphs above are a single unicode code point).

hth

r12a commented 4 years ago

In case it helps, you can see the examples at the following URLs, and by clicking on "Show codepoints" (just above the large text box) you can see the underlying sequence of characters.

Tamil: http://r12a.github.io/pickers/taml/?text=%E0%AE%95%E0%AF%81

Balinese: http://r12a.github.io/pickers/bali/?text=%E1%AC%93%E1%AD%84%E1%AC%B1%E1%AD%80

wacky6 commented 3 years ago

Updated "grapheme cluster" to "grapheme".

As for if ᬓ᭄ᬱᭀ should be a single unit, I think if it can't be broken down then it probably should be a single unit.

This aside, I am not sure if recognizer models (for complex scripts) can handle these subtle differences. In this case, the graphemeSet hint is essentially ignored.

r12a commented 3 years ago

I should have explaned the point about the Balinese in a little more detail. (I'm hoping to create some permanent resources that describe these kinds of issue, but in the meantime i'll write something here.) The tool i pointed to to view the Balinese can help you understand this by analysing the text, but for clarity let me point out some of the issues here (and this is by no means a complicated scenario as complex scripts go).

The sequence of characters in memory is:

Screenshot 2020-11-27 at 10 23 57

When displayed, this results in the following. The black text indicates the first grapheme cluster (2 code points, including one that becomes invisible in this situation, though not in others). The third glyph from the left (SA) is shown as a special conjoined form (which indicates that there is no vowel sound between this and the previous letter). The brown text (all of it) indicates the glyphs associated with the 2nd grapheme cluster (2 code points, one of which - the vowel o~ɔ - is split around the whole consonant cluster).

Screenshot 2020-11-27 at 10 21 42

Note, btw, that the 1st glyph on the left could also represent a different vowel (e~ɛ) were it not (eventually) followed by the final glyph on the right.

Even though this is 2 grapheme clusters, the sequence cannot be broken in the middle at a line end, although other text operations, such as backspacing, do affect only part of the sequence.

All this to illustrate the kind of things that crop up when trying to figure out what is written by looking at the visual text of languages written in complex scripts. Of course, it's nowhere near as simple as for Latin. A good deal of contextual analysis is needed, multiple visual sequences need to be mapped to the same code points, the number of permutations of glyph combinations can be quite large, and the minimal units used for comparison may need to be equivalent to less or more than one grapheme cluster.

wacky6 commented 2 years ago

Closing. Terminologies have been updated to "grapheme" / "user-perceived character".