CJK: "嘅(U+5605)" is not consistent with other words with the same phonetic component

GoogleCodeExporter commented 9 years ago

The word "嘅" (U+5605) is composed by a phonetic component "既" with a 
radical "口" (the mouth), i.e. 嘅 = 口 + 既.

Similar words are 溉, 概 and 慨, where 溉 = 氵+ 既 , 概 = 木 + 既, and 
慨 = 忄 + 既.

There are at least 3 common writing forms for 既. They are represented as 
U+65E2(既), U+65E3(旣) and U+FA42(既) in Unicode. Among these three, the 
first one "既" is the most common. So it makes sense to see 溉 composed by 
氵 + 既, 概 composed by 木 + 既, and so on.

Historically "嘅" is a variant of "慨" (U+6168, 忄 + 既). Nowadays, "嘅" 
is widely used by the Cantonese community with a different meaning (it is used 
as a possessive or final particle).

Therefore, the writing form of "既" in "嘅" should be consistent with the 
other words with the same phonetic component.

However, in Nato CJK, it is composed by 口 + 旣 (U+65E3), which makes its 
form inconsistent with the other words.

Attached is the sample of two fonts: One is Nato, another is MHeiHK-Medium from 
Monotype Hong Kong.

Thank you!

Original issue reported on code.google.com by e.ta...@gmail.com on 16 Jul 2014 at 5:52

Attachments:

ge3.png

GoogleCodeExporter commented 9 years ago

Thanks for reporting the inconsistency. This actually works as intended due to 
technical limitations.

The OpenType format allows at most 65,535 glyphs in a font file. With that, we 
can only accommodate Taiwan Ministry of Education standards for characters in 
the Big5 character set in this CJK font. For characters outside of the Big5 
character set, a glyph common to Traditional Chinese, Simplified Chinese, 
Japanese and Korean (if applicable) are used to save space. Such characters may 
or may not conform to Taiwan MoE's standard. It is the latter in the case of 
"嘅" (U+5605).

Original comment by ping...@google.com on 16 Jul 2014 at 6:53

GoogleCodeExporter commented 9 years ago

Thank you for your quick response. I'm not sure if I understand you correctly.

1. I am now referring to the typeface dedicated to Traditional Chinese, where 
Taiwan Ministry of Education standard is followed (as you have said). I am not 
asking for providing a alternate writing style for the same codepoint.

2. For the case "嘅", this word as a variant to "概" is now obsolete. 
Nowadays, 嘅 should only be used in Cantonese-speaking regions including Hong 
Kong. Its writing standard is similar to that of Taiwan. So, I think it makes 
sense to render 嘅 consistently in the TC version of the font.

3. I understand that 嘅 is not in the Big5 charset. That said, I am unaware of 
any region which uses 旣 as the standard form which results in the phonetic 
component of 溉概慨嘅 to be rendered as 旣(白+匕) instead of 既 for the 
sake of consistency. I am no expect on this topic, so I did a little bit of 
search and used Noto font specified for other regions to support my assumption: 

a) AFAIK both Traditional and Simplified Chinese use "既" as the standard form.
b) For Japanese, I searched on the net and found that "既" should also be 
their standard form. From the Noto Japanese font, 忄既 (but not 忄旣) is 
used as the writing form of "慨" (がい) . The writing form of "既""旣" and 
"概""槪" are not normalized so they have different codepoints, but there is 
only one codepoint for "慨". If "忄旣" is the standard form it should be 
used in the Japanese version of the font for this codepoint.
c) Korean is where I have least confidence. But from the Noto Korean font, 
"慨" is present but it is also not in the form of "忄旣". Instead it is 
rendered as "忄既". And there is no "口既" in the font.

So I still don't understand why "口旣" is chosen for the Traditional Chinese 
version of 嘅 even for the sake of a common glyph across different regions due 
to a technical limitation. On the contrary, this decision makes this word 
"stand out".

4. Perhaps the most interesting part is that, as you can see in the screenshot, 
"嘅" in Simplified Chinese version of Noto is exactly what I am asking for. So 
it seems that the glyph is already there, but the incorrect(?) one is chosen 
for the Traditional Chinese version. In this case my issue can be rephrased as 
: I believe that the glyph of "嘅" in Traditional Chinese version font should 
follow that of the Simplified Chinese version for consistency.

Original comment by e.ta...@gmail.com on 16 Jul 2014 at 9:11

Attachments:

ge3b.png

GoogleCodeExporter commented 9 years ago

Original comment by roozbeh@google.com on 16 Jul 2014 at 6:06

GoogleCodeExporter commented 9 years ago

I went back to the standards and I found that the current glyph actually agrees 
with the standards.

U+5605 is listed in CNS-11643 at code point 3-4636, and here is its page: 
http://www.cns11643.gov.tw/MAIDB/query_general_view.do?page=3&code=4636.
It is written as 口+旣 in the Mingti/Kaiti/Songti samples, and its components 
include 白 and ⼔.

I'm attaching a screenshot of the unicode chart for U+5605. Taiwan, Japan and 
Korea write it in the same way, namely 口+旣, while China and Vietnam write 
it as 口+既.

My conclusion is that the current U+5605 glyph shown in Traditional Chinese 
conforms to standards.

Original comment by ping...@google.com on 16 Jul 2014 at 7:28

Attachments:

[Screen Shot 2014-07-16 at 12.04.44 PM.png](https://storage.googleapis.com/google-code-attachments/noto/issue-38/comment-4/Screen Shot 2014-07-16 at 12.04.44 PM.png)

GoogleCodeExporter commented 9 years ago

Thanks for your clarification. This news is quite astonishing to me because 
this decision sounds irrational:

1) 既 and 旣 are just variant of each other, and are not different words with 
different meanings.
2) Considering their origin (but not the standard), 溉概槩慨廐暨厩穊 
and 嘅 are all using 既 as its component, so their writing style should be 
the same. Mixing different writing styles for the same component in one font 
gives an impression that things are not organized properly.
3) ToE adapts 既 as the standard form (既 is also the standard form in Hong 
Kong).
4) Yet the standardization body in Taiwan chose 口旣 but not 口既 for the 
word 嘅.

The most ironic thing is that they are not using this word actually (it is a 
"rarely used character" as indicated in their website) and they made such a 
strange decision. We people in Hong Kong are using this word every minute every 
day and we have no say on it (嘅 doesn't exist in Big5 but in Big-HKSCS as it 
is a frequently used character in HK). 

I am not blaming you guys for this because it isn't your fault, but I still 
want to see if I can do something to clean this up:

1) I can't comment on how it is written in Japan/Korean (although I doubt if 
they use this word differently[1]). But for the TC version of the font, I heard 
that the writing style complies to the standard by the Ministry of Education. 
So I would like to ask is "嘅" required to be written as "口旣" by the MoE? 
I highly doubt it because it is so rarely used in Taiwan. If it isn't, is it 
possible for you to give special treatment to this word so that its form is 
harmonic to other glyphs with the same component? This will be beneficial to 
Cantonese speaking community like Hong Kong and will have nearly zero impact to 
other Traditional Chinese community (because, well, they seldom use it).

2) If you cannot give special treatment to this word because your policy is to 
following the standard no matter it's right or wrong, I would be grateful if 
you could give me any hint on how can I report this issue to the consortium or 
organization or any body who is responsible for this matter.

Thanks again.

[1] "嘅" doesn't exist in a Japanese dictionary. AFAIK it is not a Kanji used 
in Japan:
http://dictionary.goo.ne.jp/search.php?IE=UTF-8&MT=%E5%98%85&kind=all&mode=0&SH=
1&from=gootop

Korean treats 嘅 as a variant of 慨 as from the Korean Hanja dictionary :
http://hanja.naver.com/search?query=%E5%98%85

Original comment by e.ta...@gmail.com on 17 Jul 2014 at 2:52

GoogleCodeExporter commented 9 years ago

I think the point is, we decided to use TW standard on Big5-HKSCS if a 
character is in Big5-HKSCS but outside Big5 range in order to make TC 
consistent. Usually it's not a big problem to HK people. However, 嘅 differs 
much in component level not stroke level and it is a high frequency character 
in HK (by this bug report).

Given 嘅 is outside Big5 and rarely used, we may need to reconsider to use 
glyph 口既 as exception for practical reason.

Original comment by k...@google.com on 17 Jul 2014 at 11:19

GoogleCodeExporter commented 9 years ago

The MOE, in general, uses the KangXi form for words that exist in Plane 3, and 
refuse to correct inconsistencies of glyph components to align them with Plane 
1/2.

I think the standards body here cannot be referred to, especially due to their 
indifferent attitude in dealing with words they rarely use (but are used in 
other Tradtional Chinese using communities).  Arguably, the MOE has no right to 
decide how to write characters that they discourage using, and for the sake of 
visual consistency, 口既 should be used.

I look forward to Google overturning the decision for this particular character.

Original comment by henry.fa...@gmail.com on 17 Jul 2014 at 12:31

GoogleCodeExporter commented 9 years ago

To gauge the use of 嘅 in Hong Kong, Taiwan, Japan and Korea, albeit 
inaccurately and not too scientific nor too representative, I did the following 
Google searches and read the number of entries.
- Hong Kong: https://www.google.com.hk/#q=%22%E5%98%85%22+site:.hk  2,110,000
- Taiwan: https://www.google.com.tw/#q=%22%E5%98%85%22+site:.tw  69,100
- Japan: https://www.google.co.jp/#q=%22%E5%98%85%22+site:.jp  32,300
- Korea: https://www.google.co.kr/#newwindow=1&q=%22%E5%98%85%22+site:.kr  
62,600

Besides the sheer difference in order of magnitude on the counts, the top 
entries of the search results in Taiwan, Japan and Korea are mostly Cantonese 
texts. That's an indication that "嘅" is indeed mostly used as a Cantonese 
character but not much in other region.

With this data and the unicode glyph for Hong Kong (H-9DEF), now I tend to 
agree that it makes sense to use the Hong Kong glyph for 嘅.

Are there other characters in the same class, namely 
1. Traditional Chinese frequently used in Hong Kong but nowhere else, and
2. the glyph does not conform to Hong Kong standard, and
3. it is outside of Big5 character set?

I think it's worth listing them for consideration all together.

Original comment by ping...@google.com on 17 Jul 2014 at 4:48

GoogleCodeExporter commented 9 years ago

For now I can only report when I find something doesn't seem right. It would be 
better if there is a checklist, but I don't have one.

That said, I just spotted another word with the same issue, which is U+7740 
(着)。
1. Historically, 着 is a variant of 著 (U+8457, which means [a] famous or [b] 
to wear). 
2. In Taiwan, people don't use 着 as 著 is always the preferred form. 着 
isn't in the Big5 table.
3. In Hong Kong, we write 著 for "famous", and write 着 for "to wear". It is 
included in HKSCS.

The problem with 着 is that its upper component isn't consistent with other 
word like 差. 
It is using a component from Simplified Chinese, which is one stroke less than 
the Traditional Chinese component.
"目" is composed of 5 strokes, so the T.C. vesrion of 着 adds up to 7+5 = 12 
strokes:
http://www.edbchinese.hk/lexlist_en/result.jsp?id=2757&sortBy=stroke&jpC=lshk

So I would be more appropriate for this word to use the glyph same as the JP/KR 
one. Thanks.

Original comment by e.ta...@gmail.com on 18 Jul 2014 at 3:43

Attachments:

GoogleCodeExporter commented 9 years ago

Re-attach the comparison image

Original comment by e.ta...@gmail.com on 18 Jul 2014 at 3:44

Attachments:

7740.png

GoogleCodeExporter commented 9 years ago

Attached is the unicode chart for U+7740. Indeed the Hong Kong glyph is the 
same as Japan or Korea, but different from China or Taiwan.

I did the same Google search count exercise for U+7740.

Hong Kong: https://www.google.com.hk/#q=%22%E7%9D%80%22+site:.hk  3,290,000
Taiwan: https://www.google.com.tw/#q=%22%E7%9D%80%22+site:.tw  4,030,000
Japan: https://www.google.co.jp/#q=%22%E7%9D%80%22+site:.jp  1,090,000,000
Korea: https://www.google.co.kr/#q=%22%E7%9D%80%22+site:.kr  68,300,000

This data seems to hint that the use of U+7740 isn't dominated by either Hong 
Kong or Taiwan. With that, it's hard to justify using one glyph or another in 
this font.

Original comment by ping...@google.com on 22 Jul 2014 at 8:28

Attachments:

U+7740.png

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

(Sorry, I kept using the wrong account, ignore previous two comments sent via 
email)

According to the principles of standard form from Taiwan MOE 
("國字標準字體研訂原則"), 「上『羊』之中筆分成兩筆」 
"The 羊 at the top should be broken into two strokes", the word 着 should 
have used the H-source / J-source / K-source form 
(http://www.edu.tw/files/site_content/M0001/biau/f61.htm?open). 

着 is deemed a variant of 著 by Taiwan MOE 
(http://140.111.1.40/yitia/fra/fra03506.htm #a03506-003) (in fact, the H-source 
/ J-source / K-source glyph is used, while the current T-source glyph is 
nowhere to be found), thus it is assigned to CNS11643 Third Plane 罕用字 
(Rarely-used Characters) 
(http://www.cns11643.gov.tw/AIDB/query_general_view.do?page=3&code=3757).  
However, CNS11643 refuses to correct representative glyphs for these 
"rarely-used characters" in Plane 3 that deviate from the Taiwan MOE rules on 
the basis that they are rarely used anyway.  Thus, the CNS11643 representative 
glyph itself is not always consistent, the same situation as 嘅.

According to MOE standard, the word 著 must be used instead of 着. Yet in 
Hong Kong 著 and 着 are, (at least in real life) nearly always used for 
different contexts. From a practical point of view I also see no reason to 
adhere to T-source glyph when this character is actively discouraged, while the 
word is in use a lot by Hong Kong.

Please note that in Big5, the character 着 has not been typeable until after 
the invention of HKSCS. However, in most Hong Kong based websites, the uptake 
of HKSCS has not been very high. It is until recently have sites shifted to 
utf-8. Most of the older content will contain 著 instead of 着 due to mapping 
rules of Big5. Meanwhile, many new infotainment Taiwan sites directly copy 
content from mainland Chinese sites and fail to convert the 着 to the MOE 
mandated 著. Not to mention the population of Taiwan is a multiple of Hong 
Kong. Thus the statistics from Google search are basically pointless to show 
that Hong Kong's use and Taiwan's use is similar.

Original comment by henry.fa...@gmail.com on 23 Jul 2014 at 3:12

GoogleCodeExporter commented 9 years ago

Re #11:

Unlike 嘅, 着 is not exclusively used in Hong Kong because it is not only 
used in Cantonese.

But the underlying problem is the same: different glyph being used for the same 
component in different words in Traditional Chinese version of the font. The 
glyph of the upper component of 着 is consistently used across words like 差, 
羌, 羞 in the Japanese and Koren version. There is no controversy in the J/K 
font. Simplified Chinese tries to save one stroke so it modified how the 
component is written, but it is still consistent in the aforementioned words. 
So, also no problem in the SC font.

However, for the Traditional Chinese version, the SC version of the glyph is 
used for 着. So it appears that the upper component of 差羌羞 is different 
than that of 着. But it isn't true. Proof:

a) Table of Basic Components for Song Style (Print Style) Chinese Font in Hong 
Kong [1]
b) Education Bureau of Hong Kong [2]

So, the glyph difference makes it fail to conform to HK standard.

And the underlying cause of this problem appears to be the same (as that of 
嘅): while 著着 is treated as different words in HK, TW chose 著 as the 
standard form. Thus 着 isn't in the Big5 character set, and MoE didn't care 
about how it should be rendered in Unicode. 

Therefore, I believe that the glyph of 着 in TC should be modified to follow 
the glyph in J/K.

[1] 
http://www.ogcio.gov.hk/tc/business/tech_promotion/ccli/terms/doc/c_gsect4.pdf
[2] 
http://www.edbchinese.hk/lexlist_en/result.jsp?id=2757&sortBy=stroke&jpC=lshk

Original comment by e.ta...@gmail.com on 23 Jul 2014 at 3:48

Attachments:

GoogleCodeExporter commented 9 years ago

I would like to add that the Google search result of "着" in .tw website may 
not reflect how frequent it is used in Taiwan. I checked the search result in 
the first page, and found only 2 of them (out of 10) are really websites from 
Taiwan. Among this two, one result comes from a book store showing the title of 
a Simplified Chinese book. The only genuine Taiwan website using the word "着" 
is Apple Daily Taiwan [1]. And I think it is a very special case because the 
word 着 is used in the song name "你敢有聽着咱的歌" which is not a 
Mandarin song but a song in "台語" [2] (台語 has its own presentation words 
different from the standard). 

And even if Taiwanese is using it it does not change the fact the the component 
of 差羌羞着 should be the same. I suspect if people in Taiwan would agree 
to write 着 in the Simplified Chinese way.

[1] http://www.appledaily.com.tw/realtimenews/article/new/20130803/236328/

[2] http://en.wikipedia.org/wiki/Taiwanese_Hokkien

Original comment by e.ta...@gmail.com on 23 Jul 2014 at 4:03

GoogleCodeExporter commented 9 years ago

regarding 嘅, glyph for Traditional Chinees should be changed to Noto's 
Simplified Chinese glyph (i.e., 口 + 既)

regarding 着, it sounds like a bug should be reported again Taiwan MoE, 
instead. pinyeh, kcwu, please confirm.

Original comment by xian...@google.com on 7 Aug 2014 at 9:22

GoogleCodeExporter commented 9 years ago

With regard to 着 (U+7740), its Traditional Chinese (Taiwan) source is CNS 
11643 Plane 3 0x3757, and the representative glyph in the 1992 and 2007 
versions agree, and are what we are currently using for Traditional Chinese. 
When I consult the Taiwan MOE glyph standards, this character's index is 
408938, and is tucked away among the variants (異體字), and agrees with CNS 
11643. To me, its form seems intentional. But, because it is outside the scope 
of Big Five, but in Hong Kong SCS (0xFED3), the form that is being requested is 
the same as the Japanese form, so a remapping can easily take care of this.

Original comment by ken.lu...@gmail.com on 11 Aug 2014 at 10:36

GoogleCodeExporter commented 9 years ago

Original comment by xian...@google.com on 5 Sep 2014 at 9:14

Changed state: Fixed

al-abdellaoui / noto

CJK: "嘅(U+5605)" is not consistent with other words with the same phonetic component #38