Consolidation of Additional Glyph & Character Suggestions (See Issue #180)

ShikiSuen commented 9 years ago

Currently, Source Han Sans TW does not include simplified kanji glyphs used in PRC. But MOE had shown in their 「全字庫正宋體」 and 「全字庫正楷體」 that their standards are applied to such glyphs.

The downloadable fonts of「全字庫正宋體」 and 「全字庫正楷體」 are available at this website: http://data.gov.tw/node/5961

Meanwhile, there are other MOE-oriented fonts who adopts such MOE-standard glyphs regarding PRC-Simplified Chinese: // PingFang, the default Traditional Chinese fallback font since OS X El Capitan and iOS9: // DFKai-SB, a.k.a. 標楷體, supports Simplified Chinese kanji glyphs since Windows Vista: // MOE Sung UN // PMingLiU, MOE CNS11643 standard font since Windows Vista:

I created this thread in order to follow Ken Lunde's slogan in issue #99 : So that people could make related discussions here (instead of in issue #99 ) before Ken Lunde make his final decisions regarding it.

ShikiSuen commented 9 years ago

Here are my opinions:

As a grown-up mainland PRC passport owner, I feel that those simplified kanji glyphs used in PRC are better-designed in 「全字庫正宋體」 and 「全字庫正楷體」 since they are easier and faster to write. Meanwhile, this makes each glyph looks more unique.

Update: PingFang is manufactured by DynaComware only.

ShikiSuen commented 9 years ago

(carbon copy sent to @jimmymasaru .)

kenlunde commented 9 years ago

PingFang is a Pan-Chinese typeface family that does not use region-specific subsets, which means that the Simplified and Traditional Chinese fonts have the same Unicode coverage. The character in question, U+604B (恋), is in CNS 11643 Plane 3, but the scope of Traditional Chinese in SHS is capped at Big Five Levels 1 and 2, which are equivalent to CNS 11643 Planes 1 and 2.

ShikiSuen commented 9 years ago

Reference: https://zh.wikipedia.org/zh-hant/%E5%A4%A7%E4%BA%94%E7%A2%BC

Based on such reference, there are some glyphs not included in Big5. some of them are: But, some of them are still used in current Taiwan even though they are not in Big5... such as "峯" (used in a Taiwan singer & songwriter's name "吳青峯"), "栢" (a Hong-Kong movie star "張栢芝"), "邨", and "啓" (Traditional Chinese version of C&C Red Alert 2, "天啓坦克" = "Apocalypse tank").

I am not familiar with this since I never use those fonts which supports Big5 glyphs only. Thus, I couldn't tell whether SHS TW (Regional Specific Release) supports it or not.

tamcy commented 9 years ago

Unlike 恋, words listed on the above table (蟎綫綉滙栢峯頴邨着双啓) are all covered by HKSCS-2008, which should imply that they are also covered by Source Han Sans TW.　

kenlunde commented 9 years ago

The scope of Source Han Sans TW, which is a subset of the 65,535-glyph glyph set, is Big Five + Hong Kong SCS (in terms of code points for hanzi). The best work-around is to simply use Source Han Hans TC, which includes all 65,535 glyphs, and thus has the maximum coverage of code points. (This suggestion is separate from having a glyph that is appropriate for TW.)

In terms of actually extending TW coverage, which means using glyphs that are appropriate for TW use (and thus follow MOE guidelines), the issue is of scope. Big Five is used because it represents the most common hanzi in use, and the problem that we will run into is the lack of available CIDs.

In any case, I am now working on the plan and scope for Version 2.000, and am taking all of this into consideration, though the highest priority is proper Hong Kong support.

kenlunde commented 9 years ago

I am consolidating Issue #118 (for adding a CN glyph for Extension E U+2C386, 𬎆, ⿰王莹) here.

kenlunde commented 9 years ago

I am consolidating Issue #121 (for adding glyphs for U+9FD1 through U+9FE9 and U+2B7F7) here. I am adding a note on 2017-02-21 to indicate that U+2B7F7 𫟷 is the Simplified Chinese name of Element 116, which is an outlier in terms of covering all of the elements.

extc commented 9 years ago

Adobe-CNS1-6 is the CMap standard of Traditional Chinese OpenType CIDfonts. The basis were BIG5, extended characters from DynaLab and Monotype, GCCS, HKSCS-1999, HKSCS-2001, HKSCS-2004 and HKSCS-2008. Ken Lunde had released CNS11643 Plane 1 to 7 and 15 (1992 standard) PDFs in ftp://ftp.oreilly.com/examples/cjkvinfo/AppG/

BIG5 was a very old standard (1984). Now the CNS11643 included 107171 characters. I know It is not possible to include all as only 86655 characters are mapped to Unicode. But CNS11643-1992 was a de facto standard. It was implemented as EUC-TW encoding in UNIX terminals. Also, all EUC-TW characters had already mapped to Unicode. The Adobe CMap should at least include all the characters in CNS11643-1992 version so as to reflect its name. The development of Source Han Sans depends on CIDs. Therefore I think Adobe should update the Adobe-CNS1 Cmap in parallel to the development of Source Han Sans.

kenlunde commented 9 years ago

@extc: I read and re-read your note above, and am still at a loss as to what you are requesting, but perhaps what I wrote below may help.

Adobe-CNS1-6 is cumulative, meaning that glyphs are added incrementally, so there is a diachronic effect. Supplement 0 supported only Big Five and the ETen extensions, but the /Ordering was set to CNS1 because of Big Five's relationship to CNS 11643, and such a name opened the possibility of extending the glyph set to cover additional CNS 11643 planes. Supplement 1 added support for Hong Kong GCCS and the Hong Kong extensions from DynaComware (Dynalab) and Monotype Imaging (Monotype). Supplement 2 simply added pre-rotated versions of non–full-width glyphs that are accessible via the (now-deprecated) 'vrt2' GSUB feature. Supplements 3 through 6 were for supporting the 1999, 2001, 2004, and 2008 versions of Hong Kong SCS, respectively.

The PDFs from CJKV Information Processing (First Edition) were made by using an experimental Adobe-CNS2-0 glyph set whose purpose was to simply show all characters in CNS 11643-1992, along with Plane 15.

Although CNS 11643 is large, and has expanded beyond the 1992 version, it is not nearly as interesting as the CJK Unified Ideographs in Unicode, meaning the URO and Extensions A through E. The latter has excellent interchange, but the former has very poor interchange. CNS standards are also quite messy, and provide little or no metadata, such as dictionary mappings or other ways to verify a character's meaning or shape.

fei0316 commented 9 years ago

Add the character 𧒽(U+274BD) #133

The character, although not being in the any of the supported standards, should be added. This character is used as a station name of Guangzhou Metro (𧒽岗站), as a park name near that station (𧒽岗公园) and also as a name of a type seafood produced in that area. The character is supported by MingLiU_HKSCS-ExtB font and it's also successfully shown properly on OS X 10.10 Yosemite and Windows 10 by deault. This character was proposed to be added, but later removed from 通用规范汉字表. Any documents, banners, and websites with that character would usually be written as 「虫雷」or「礌」. People also claimed to have problems finding that station on the mobile phone app. Maps showing that station or the park have to use other words to replace the unsupported word. As the goal of this font is to maximize compatibility, adding this character can really benefit a lot of people considering the fact that all Android devices running Android 5.0 or above are using this font. Reference: https://zh.wikipedia.org/wiki/%F0%A7%92%BD%E5%B4%97%E7%AB%99 https://zh.wikipedia.org/wiki/%E9%BB%83%E6%B2%99%E8%9C%86 http://news.sina.com.cn/o/2014-07-24/142430572774.shtml http://baike.baidu.com/view/4731307.htm https://www.google.com/maps/place/%E7%A4%8C%E5%B2%97/@23.0442069,113.1465266,16z/data=!4m5!1m2!2m1!1z6Jmr6Zu35bKX5YWs5ZyS!3m1!1s0x0000000000000000:0x84aea54ce06ea2e9

kenlunde commented 8 years ago

Consider adding (KR) glyphs for U+200D7 𠃗, U+2042D 𠐭, U+224E1 𢓡, and U+23D18 𣴘. The last three are used in traditional Korean musical notation.

hfhchan commented 8 years ago

The HKSCS 2015 update is redefining some mappings from big5 to ucs. Would that affect character coverage, especially the full-width symbols?

kenlunde commented 8 years ago

@hfhchan: With regard to Hong Kong support, we sort of have a fresh slate, because to date the project does not include any Hong Kong font resources. This effectively means that accommodating mapping changes should not be problematic.

kenlunde commented 8 years ago

New CN glyphs for U+35F4 (㗴) and U+6D73 (浳), uni35F4-CN and uni6D73-CN, need to be added.

kenlunde commented 8 years ago

Consider adding KR glyphs for Extension B characters 𪓟 (U+2A4DF) and 𣖄 (U+23584) per Issue #136.

kenlunde commented 8 years ago

Per Issue #137, VN (Chữ nôm) glyphs will be supported when Extension B and beyond are supported in their entirety.

hfhchan commented 8 years ago

is "𠻹" (H-9E77) supported? It doesn't show up correctly using Noto Sans TC (http://fonts.googleapis.com/earlyaccess/notosanstc.css) on hk01.com (the character uses SimSun-ExtB instead on both MSEdge and Chrome)

Edit: Nor does 䮎 (H-92D7). On the other hand, 罉 (H-9DD1) displays correctly. 𦉘 (H-9DBC) doesn't.

kenlunde commented 8 years ago

@hfhchan: 𠻹 (Extension B U+20EF9; CID+59693) is supported by Source Han Sans / Noto Sans CJK, and is also included in the region-specific subset OTFs for Traditional Chinese (which are the fonts that are referenced in that CSS file). I inspected one of the OTFs that is referenced by the CSS file, and it has been further subsetted, and includes only 9,876 glyphs, and only three characters outside the BMP are supported: U+210C1, U+24A12, and U+25683. This is therefore a question to pose to Google.

kenlunde commented 8 years ago

罉 (URO U+7F49; CID+32230) is among the 9,876 glyphs in the OTFs that are referenced by that CSS file. 䮎 (Extension A U+4B8E; CID+9231) and 𦉘 (Extension B U+26258; CID+60806), on the other hand, are not. The glyphs for all three characters are in the official region-specific subset OTFs for Traditional Chinese. I recommend that you ask Google here.

tamcy commented 8 years ago

As indicated at https://www.google.com/fonts/earlyaccess:

Noto Sans TC has been subsetted to the most frequent 7,800 Chinese characters in Traditional Chinese documents. 223 characters are added to cover all the characters in Taiwan's CNS 11643 P1 and 常用國字標準字體表 as well as Hong Kong's 常用字字形表 and IRG HB0 and HB1. In addition to Hanzi, Bopomofo, CJK Radicals, ASCII, punctuation marks and full-width characters are included.

Noto Sans TC does not include the full version of the font. As Google Font doesn't employ Dynamic Subsetting or similar technology, they have to trim the number of characters down to achieve a reasonable download size .

kenlunde commented 8 years ago

@tamcy: Thank you for finding the explanation. For those who want to use the region-specific subset Source Han Sans OTFs as webfonts, Typekit offers them via dynamic augmentation (these are referred to as "dynamic kits"). One significant advantage of dynamic augmentation over other webfont approaches that involve subsetting is that typographic features (GSUB and GPOS), along with UVSes (Unicode Variation Sequences), are fully supported. This 2015-06-16 article demonstrates such functionality (although Chrome is currently broke in terms of the 'ccmp' and ljmo+vjmo+tjmo features, but the fixed has not yet propagated to a released version).

A special version of Source Han Sans (with all 65,535 glyphs) has been served to the CJK Type Blog via Typekit since last June.

CrotchBurnt commented 8 years ago

Please add U+2B689 (𫚉, ⿰鱼工)

hfhchan commented 8 years ago

魟, a type of stingray worth up to NT$ 100,000 even when young. Quite common character in traditional Chinese. The baidu article uses the traditional character 魟 instead.

kenlunde commented 8 years ago

Noted. Thanks.

glll4678 commented 8 years ago

Please added Taiwanese Southern Min and Taiwanese Hakka Character.

https://bobbytung.github.io/TaigiHakkaIdeograph/

kenlunde commented 8 years ago

@glll4678: You need to be a bit more clear with regard to your request, and the data has serious problems.

About more clarity, is the request about code point coverage? What about regional style? Is Taiwan MOE acceptable? The representative glyphs aren't particularly helpful, as they come from a mixture of styles, one of which is Japanese, which is clearly not appropriate.

About the problems, the characters that are mapped from PUA code points are non-starters. The last thing that a Pan-CJK font needs are glyphs that are mapped from PUA code points. Looking at one of them, U+E707, it is in Extension E at U+2C44E. Someone, hopefully not me, needs to thoroughly check all PUA and unencoded glyphs against all extensions, including F (Unicode 10.0) and drafts of Extension G. Any character not found should be considered for a future extension. If TCA (aka Taiwan) is unwilling to submit them, the UTC could do so on behalf of the user community.

kenlunde commented 8 years ago

@glll4678: With regard to the four frequently-used characters in that list that lack Unicode values, two are actually in Extension D (Unicode Version 6.0) as U+2B75E with a T-Source (TB-732C) and U+2B7CA with a T-Source (TB-732D), and another is in Extension G (in process) with a U-Source (UTC-02663). The fourth one is not yet encoded, and not in any proposal.

glll4678 commented 8 years ago

https://docs.google.com/spreadsheets/d/1RPN_4sVAYlHaMchak1wwf2dpNKRcLil7WFHkB1ZGVuE

I've check all PUA and unencoded glyphs. (to G)

kenlunde commented 8 years ago

@glll4678 Thank you for the table. When it comes time to consider supporting these characters, this resources will be useful.

acuteaccent commented 8 years ago

Source Han Sans covers all the characters in the Bopomofo Extended block (U+31A0 – U+31BF), which are used in Southern Min (Min Nan) and Hakka. However, it does not support the tone marks used in those languages. The tone marks are:

˪ U+02EA MODIFIER LETTER YIN DEPARTING TONE MARK ˫ U+02EB MODIFIER LETTER YANG DEPARTING TONE MARK

I suggest adding glyphs for these two characters.

kenlunde commented 8 years ago

@acuteaccent: Noted.

acuteaccent commented 8 years ago

I suggest adding glyphs for the following four sequences:

ê̄ U+00EA U+0304 ê̌ U+00EA U+030C Ê̄ U+00CA U+0304 Ê̌ U+00CA U+030C

ê/Ê is a vowel letter used in Hanyu Pinyin (corresponds to ㄝ in Bopomofo), and a tone mark can be attached above it like other vowel letters in Hanyu Pinyin (a/A, e/E, i/I, o/O, u/U, and ü/Ü; ā/Ā, á/Á, ǎ/Ǎ, à/À, ē/Ē, é/É, ě/Ě, è/È, ī/Ī, í/Í, ǐ/Ǐ, ì/Ì, ō/Ō, ó/Ó, ǒ/Ǒ, ò/Ò, ū/Ū, ú/Ú, ǔ/Ǔ, ù/Ù, ǖ/Ǖ, ǘ/Ǘ, ǚ/Ǚ, and ǜ/Ǜ).

Attaching tone marks to ê/Ê yields the following: ê̄/Ê̄, ế/Ế, ê̌/Ê̌, and ề/Ề. Source Han Sans already covers ế/Ế and ề/Ề, as they are also used in Vietnamese. The ones that are not covered by Source Han Sans are ê̄/Ê̄ and ê̌/Ê̌.

Here are some actual usages of those sequences.

In Standard Chinese, the character 欸 has the following pronunciations: āi, ǎi, ê̄, ế, ê̌, ề.
HKSCS has assigned separate code points for those four sequences (ê̄: 0x88A3, ê̌: 0x88A5, Ê̄: 0x8862, Ê̌: 0x8864).

kenlunde commented 8 years ago

@acuteaccent: Thank you.

drott commented 8 years ago

Please consider adding third-width and quarter-width glyphs for the digits 0-10, compare https://github.com/googlei18n/noto-cjk/issues/76

kenlunde commented 8 years ago

Noted, and thank you. (I think that you meant 0 through 9, not 0 through 10.)

acuteaccent commented 8 years ago

I suggest adding glyphs for the following two characters:

ʻ U+02BB MODIFIER LETTER TURNED COMMA ⁴ U+2074 SUPERSCRIPT FOUR

These are used in Wade-Giles. The first one is used for strongly aspirated consonants, and the second one is used for the fourth tone. For example, the syllable written "kàn" in Hanyu Pinyin is written "kʻan⁴" in Wade-Giles.

Wade-Giles used to be a popular Chinese romanization system in the 20th century (especially before the 1980s). Even though Wade-Giles is largely replaced by Hanyu Pinyin today, a lot of documents written in the 20th century are in Wade-Giles (and some scholars still use Wade-Giles). In order to display documents written in Wade-Giles properly (for example, when digitizing documents written in the 20th century), the characters used in Wade-Giles need to be supported.

For U+02BB, you don't need a new glyph; mapping the glyph for U+2018 ‘ would be good enough. (Some fonts need two separate glyphs for U+02BB and U+2018, and some only need a single glyph for both of them; Source Han Sans is the latter.)

acuteaccent commented 8 years ago

Are you planning to add glyphs for the following characters? U+1F10B 🄋 U+1F10C 🄌 U+1F19B 🆛 – U+1F1AC 🆬 U+1F23B 🈻

frankrolf commented 8 years ago

U+24EA ⓪ (CIRCLED DIGIT ZERO) U+24FF ⓿ (NEGATIVE CIRCLED DIGIT ZERO) are already supported.

U+1F10B 🄋 (DINGBAT CIRCLED SANS-SERIF DIGIT ZERO) U+1F10C 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) would likely just be duplicates of the original two – perhaps they could be double-mapped?

kenlunde commented 8 years ago

@acuteaccent: These have been on my Version 2.000 list for some time, and as Frank mentioned, that list specifies that U+1F10B and U+1F10C will be handled as double mappings. Also, U+312E, U+312F, U+9FD1 through U+9FEA, and U+1F12F are on the same list.

jungshik commented 7 years ago

from: googlei18n/noto-cjk#80

I compared the character repertoire of Noto Sans CJK 1.004 against [the]() list of characters allowed for South Korean family registry and found that 47 characters are missing.

The list is kr_names_missing_in_noto_sans.txt

The 1st column is Korean reading in Hangul. The second column is a Unicode code point. The 3rd is a character.

kenlunde commented 7 years ago

@jungshik: Thank you. I count 48 characters in your list, not 47, but U+23343 𣍃 appears twice, making it actually 47.

jungshik commented 7 years ago

@kenlunde Yes, that's why I said there are 47 characters :-) (I should have deleted the 2nd line with U+23343 before uploading).

acuteaccent commented 7 years ago

@kenlunde @jungshik Well, in fact, there are indeed 48 missing, as there is unencoded ⿰氵恩 (은). If Source Han Sans is targeting all the South Korean personal name hanja, one glyph needs to be reserved for ⿰氵恩.

Also, I think https://github.com/googlei18n/noto-cjk/issues/80#issuecomment-278151655 this is a very good idea, as no one actually uses/needs halfwidth hangul jamo. To begin with, I wonder why they are encoded in Unicode.

acuteaccent commented 7 years ago

(This is in regard to https://github.com/adobe-fonts/source-han-sans/issues/115#issuecomment-229554080)

Oh, the suggestion about U+02EA and U+02EB was already made before (https://github.com/googlei18n/noto-cjk/issues/56). As I usually don't check the Noto Sans CJK side, I was not aware of it until now. FYI, I learned about those two characters from here: http://www.unicode.org/versions/Unicode9.0.0/ch18.pdf#page=27

acuteaccent commented 7 years ago

BTW, if you are running out of glyphs, you can get rid of Œ, œ, and ƒ, as they are not used in CJKV languages (including common romanization systems). (If Œ and œ are included to cover French, then Ÿ also needs to be included.)

justinrleung commented 7 years ago

œ might be used in IPA and its derivative romanizations, like S. L. Wong (phonetic symbols). It might be useful to keep it for people who need to use IPA (e.g. when dealing with a Chinese dialect that does not have a romanization system).

acuteaccent commented 7 years ago

Well, I don't think the IPA is the reason for the inclusion of œ though. Source Han Sans does not cover most letters used in the IPA and its derivative romanizations (ɐ, ɛ, ɔ, ŋ, etc.) anyway.

acuteaccent commented 7 years ago

U+2780 ➀ to U+2789 ➉ and U+278A ➊ to U+2793 ➓ can be covered by using the glyphs at U+2460 ① to U+2469 ⑩ and the ones at U+2776 ❶ to U+277F ❿ respectively, as Source Han Sans is a sans-serif font. As this can simply be done by inserting additional code point mappings to existing glyphs, no new glyphs are needed.

jimmymasaru commented 7 years ago

Well, probably œ and other Latin alphabets are included in AdobeJapan1-6 which is why they are included in SHS.

adobe-fonts / source-han-sans

Consolidation of Additional Glyph & Character Suggestions (See Issue #180) #115