Typeability of ZWJ in common ligatures

eggrobin commented 6 months ago

DUTR #‌56 suggests the use of ZWJ to hint ligaturing, see https://www.unicode.org/reports/tr56/#Discretionary_Ligatures.

@crzfub, who has been developing a font that supports those ligatures, pointed out that typing a ZWJ can be tricky. It could make sense to add support for compositions such as d+en for 𒀭‍𒂗 (U+1202D U+200D U+12097); and of course one should also have the various d+suen, d+ellil, etc.

crzfub commented 6 months ago

Perhaps including forms such as d+u200D+en as options in the candidates window, or even better: including u200D in any candidate that is a 'sequence' (e.g. 𒈦<u200D>𒄘<u200D>𒃼 ) would be helpful. This would leave full control over when and where to place typographic ligatures for the user directly from 𒂗𒈨𒅕𒃸, without the user actually having to bother with the rather annoying u200D or u200C.

I highly doubt that users of cuneiform fonts will (want to) learn about u200D and u200C and assume that otherwise most unwanted ligatures would be broken with a space instead. (And ligatures requiring the User to write a u200D wouldn't be used a lot.)

This solution would be significantly more intuitive and consistent across fonts, than the fontmaker arbitrarily deciding which ligatures are discretionary and which aren't. [^1]

Further, if there was an update to unicode and subsequently to the font, that includes a former typographic ligature as a unique codepoint - the fontmaker can change typographic ligatures with the u200D to point to the new unique codepoint. Making the font backwards compatible. Otherwise the font would have to keep a now unnecessary ligature to achieve the same effect.

[^1]: Especially as the typographical ligatures may not be exactly intuitive in the first place.

eggrobin commented 2 weeks ago

Apologies for the delayed response. I had partially responded to this on Discord back in January, but I should probably write something down here too (especially since interesting examples have recently been brought to my attention).

including u200D in any candidate that is a 'sequence' (e.g. 𒈦<u200D>𒄘<u200D>𒃼 ) would be helpful

ZWJing up every diri is probably not a good idea, as it ends up working against the goals of the encoding model for cuneiform. One major underlying goal of the encoding model is to be compatible with common transliteration practices[^compatibility]. For many of these sequences, it is common practice for the transliteration to be given as a sequence, even if a ligature occurs.

To take a concrete example, the diri sign 𒋛𒀀 has a distinct shape in Hellenistic Uruk[^enrique], see, e.g., https://www.ebl.lmu.de/fragmentarium/MLC.1874 o 4.

However, that ligated 𒋛𒀀 is also used in cases where it is transliterated (and thus would be typed) si-a, such as https://cdli.mpiwg-berlin.mpg.de/artifacts/348467/reader/65783 o 1, or http://oracc.org/blms/P348565.28 r 13′. This last example (Examenstext A) is particularly interesting, as it is lemmatized with the morphology na.m:~;a, and is a witness to a composite which also has a Neo-Assyrian witness (where there is no ligature). The best way to accommodate both transliteration/input practice, and the handling of encoded composite text (such as that 𒅗𒋺𒅗𒍪𒉆𒋛𒀀 from Examenstext A), is for a Hellenistic Uruk font to have a default ligature for 𒋛𒀀 (with no ZWJ involved).

Further, if there was an update to unicode and subsequently to the font, that includes a former typographic ligature as a unique codepoint - the fontmaker can change typographic ligatures with the u200D to point to the new unique codepoint. Making the font backwards compatible. Otherwise the font would have to keep a now unnecessary ligature to achieve the same effect.

Any change to the encoding model is tremendously disruptive to users and implementers at all levels: encoded corpora would need to be updated (in some cases, transliterations may be invalidated, see above), fonts need to be updated to support the new characters, new text will fail to match old text in search. I think the UTC would not lightly make such additions; the Unicode 7.0 additions were a fairly special case[^na] as they included clear contrasts (𒈨 vs. 𒎌), and these were still somewhat disruptive; even today one still occasionally finds some bad pre-7.0 encodings.

the fontmaker arbitrarily deciding which ligatures are discretionary and which aren't.

It’s not the fontmaker being arbitrary, it’s the second millenium scribe ! :-) More seriously, those ligatures are often a property of the style, not the text (which can exist independently of the style of a particular attestation, as in composite texts, words cited in reference works, etc.), and the style is inherently up to the font. The fontmaker therefore should add those ligatures that are almost always used in the target style as default ligatures. For instance, if in some cursive style 𒀭𒂗 is nearly always ligated, it could be appropriate to have a ligature for that sequence even without the ZWJ.

On the other hand, for an Ur III lapidary font, this would be best treated as a discretionary ligature (in which case it could also be controlled by the presence of a ZWJ).

I highly doubt that users of cuneiform fonts will (want to) learn about u200D and u200C […] ligatures requiring the User to write a u200D wouldn't be used a lot.

Of course no user should have to know about the ZWJ, let alone type it (I am advising users to type zero-width spaces, and providing a way to do so, but at least those have a pretty tangible effect). These implementation details should be hidden from the user (which is what this issue is about, from the IME side), so that, for those relatively standard discretionary ligatures, such as that Ur III lapidary 𒀭𒂗, a d+en composition should be added.

[^compatibility]: While this was not clear from the text of the Unicode Standard until recently, it was well-understood in the proposal documents. The 16.0β review draft includes updated text in Chapter 11 clarifying this: https://unicode.org/versions/Unicode16.0.0/core-spec/chapter-11/#G26959. [^enrique]: This was recently brought to my attention by Enrique Jiménez. [^na]: Of course another aspect here is that Neo-Assyrian has a very special status in many reference materials, where it otherwise tends not to require ligatures. Major reference works sometimes shape the encoding model in unexpected ways; I am reminded of this old discussion about CJKV Extension B, which I came across recently while perusing the Unicode mailing list archives: https://www.unicode.org/mail-arch/unicode-ml/y2004-m06/0223.html.

eggrobin / Enmerkar

Typeability of ZWJ in common ligatures #4