adobe-fonts / source-han-sans

Source Han Sans | 思源黑体 | 思源黑體 | 思源黑體 香港 | 源ノ角ゴシック | 본고딕
Other
14.44k stars 1.3k forks source link

U+30FB and U+2027 are not full-width #295

Open RuixiZhang42 opened 3 years ago

RuixiZhang42 commented 3 years ago

Prerequisites

Description

Source Han Serif mirror issue: https://github.com/adobe-fonts/source-han-serif/issues/93

Using language-specific OTFs (with full 65535 glyphs support, not the subset OTFs), the character (U+30FB, Katakana Middle Dot) is sometimes rendered as proportional-width, but it should always stay full-width. Here are the steps to reproduce this bug:

  1. Type into whatever layout program.
  2. Use any one of the 3 Chinese-oriented (SC, TC, or HC) OTFs to render this character.
  3. Either language-tag with ZHS, ZHT, or ZHH (under any script latn, grek, cyrl, kana, hang, or hani), or just use the font’s default script and language. The character is rendered as full-width.
  4. Switch language tag to JAN or KOR, then the character becomes proportional-width.
  5. But using J or K versions of the OTFs, U+30FB will stay full-width.

Similarly, the character (U+2027, Hyphenation Point) has the exact same problems.

Bug analysis

By default, to render either U+30FB or U+2027, the 5 OTFs (SC, TC, HC, J, and K) all use cid1644 (full-width).

To render · (U+00B7, Middle Dot), SC, TC, and HC still use cid1644 (full-width). However, J and K use cid117 (proportional-width).

To render (U+2022, Bullet), SC, TC, and HC still use cid1644 (full-width). However, J and K use cid733 (proportional-width, but a different one).

The lookup tables cn2jp, cn2kr, tw2jp, tw2kr, hk2jp, and hk2kr all contain the following line:

  substitute \1644 by \733;

So this is the source of the problem:

  1. This substitution is needed, because when a user types (U+2022), we want it to be full-width in ZHS, ZHT, and ZHH, but we want it to be proportional-width in JAN and KOR.
  2. But this simple substitution carries two problems:
    1. When a user types either (U+30FB) or (U+2027), the substitution to proportional-width still happens in SC, TC, and HC, although either glyph should stay full-width.
    2. When a user types · (U+00B7) using the SC, TC, or HC font, but language-tagged the character with JAN or KOR, the result is cid733. But using the J or K font, the result is cid117. The two should all be cid117.
RuixiZhang42 commented 3 years ago

About U+00B7

BTW, Simplified Chinese (ZHS) should not use cid1644 (full-width) to render U+00B7. According to GB/T 15834-2011, U+00B7 is recommended to be used as the “separator mark” (§ 4.14 and ¶ 4.14.3.5) and it should be half-width (¶ 5.1.7). However, there are some caveats:

  1. All major foundries in mainland China (Founder Type, Hanyi, etc.) do not follow GB/T 15834-2011. They make full-width U+00B7, likely for backward compatibility to Founder’s own layout software.
  2. GB/T 15834-2011 contradicts itself by using full-width U+00B7 everywhere in § 4.14.
  3. Also, the so-called “半角” (“half-width”) can sometimes be interpreted as “proportional-width”. For example, when GB/T says “半角数字” (“half-width figures”) it doesn’t always mean “the figures must be exactly half-width”. Sometimes it just means “single-byte figures that are encoded in the ASCII range, not double-byte figures that are encoded in the Halfwidth and Fullwidth Forms Unicode block”.

In view of these caveats, Source Han Serif SC actually maps a proportional glyph to U+00B7. Perhaps Source Han Sans SC can do the same, i.e.,

For UniSourceHanSansCN-UTF32-H, merge the following three lines:

line    68: <000000b7> 1644
line 11825: <000000ae> <000000b6> 108
line 11826: <000000b8> <000000ff> 118

into one line in the 100 begincidrange block:

<000000ae> <000000ff> 108 %% This makes cid117 maps to U+00B7 for Simplified Chinese,
                          %% so that Source Han Sans SC behaves the same as Source Han Serif SC.

In Taiwan and Hong Kong, U+00B7 is usually full-width. But I can’t find official standards that require it to be full-width.

About U+30FB and U+2027

AFAIK, Taiwan and Hong Kong users prefer U+2027 as the “separator mark”, but they occasionally will use U+30FB too. Japanese texts use U+30FB as the “separator mark”.

In any case, U+30FB and U+2027 should stay full-width when switching language tags.

About U+2022

Japanese and Korean texts don’t use U+2022 as the “separator mark”, and thus it makes sense to keep this character proportional for JAN and KOR.

I’m not aware of official standards from mainland China, Taiwan, or Hong Kong that require U+2022 to be full-width. But users from these regions may have expectations that this character should be full-width, because of decades of exposures to local foundries practice.

RuixiZhang42 commented 3 years ago

Pictures worth a thousand words:

  1. SourceHanSansBugs

  2. SourceHanSansJOTF

  3. SourceHanSansSCOTF

RuixiZhang42 commented 3 years ago

This could well be a systematic error, which could potentially affect many more code points beyond just U+00B7, U+2022, U+2027, and U+30FB.

Well… You know there’s a saying “mathematicians love to generalize things”? So… Here is a sufficient condition for this bug to appear with other code points:

Definitions and goal

analysis1

Sufficient condition for this bug

analysis2