foliojs / fontkit

An advanced font engine for Node and the browser
1.45k stars 213 forks source link

encoding problems after dropping iconv-lite #284

Closed sohobloo closed 2 years ago

sohobloo commented 2 years ago

Since fontkit dropped iconv-lite, the encoding of hz-gb-2312 no longer work. the last one in these line: https://github.com/foliojs/fontkit/blob/master/src/encodings.js#L93

It's strange that the w3c standard indeed has this encoding and web api has it too. But this code raises an error "The "hz-gb-2312" encoding is not supported" for me:

new TextDecoder('hz-gb-2312')

instead acrossing my experience, I changed hz-gb-2312 into gb2312 and it works as expect. I don't know if there are other encodings have similar problems.

devongovett commented 2 years ago

Interesting. Looks like gb2312 is an alias for gbk, which is a superset of gb2312: https://en.wikipedia.org/wiki/GBK_(character_encoding). Feel free to open a PR to change it.

sohobloo commented 2 years ago

Interesting. Looks like gb2312 is an alias for gbk, which is a superset of gb2312: https://en.wikipedia.org/wiki/GBK_(character_encoding). Feel free to open a PR to change it.

OK, I'll try all the encodings listed in this file and find out the compatible encoding for TextDecoder.

sohobloo commented 2 years ago

I did some researches and correct encodings that I'm sure.

PR:

https://github.com/foliojs/fontkit/pull/285

refs:

https://docs.microsoft.com/en-us/typography/opentype/spec/cmap

https://docs.microsoft.com/en-us/typography/opentype/spec/name

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6name.html

https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding

https://www.w3.org/International/docs/encoding

https://encoding.spec.whatwg.org/

http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/

https://wutils.com/encodings/

research table

Platform ID Encoding ID Description
(Apple)
Fontkit 2.0.2 encoding TextDecoder.encoding
(W3)
Remarks
0
(Unicode)
0 Unicode 1.0 semantics—deprecated utf16be utf-16be
1 Unicode 1.1 semantics—deprecated utf16be utf-16be
2 ISO/IEC 10646 semantics—deprecated utf16be utf-16be
3 Unicode 2.0 and onwards semantics, Unicode BMP only utf16be utf-16be
4 Unicode 2.0 and onwards semantics, Unicode full repertoire utf16be utf-16be
5 Unicode Variation Sequences—for use with subtable format 14 utf16be utf-16be
6 Unicode full repertoire—for use with subtable format 13 NOT FOUND utf-16be
1
(Macintosh)
0 Roman x-mac-roman x-mac-roman / macintosh
1 Japanese shift-jis shift-jis / shift_jis
2 Chinese (Traditional) big5 big5
3 Korean euc-kr euc-kr
4 Arabic iso-8859-6 iso-8859-6
5 Hebrew iso-8859-8 iso-8859-8
6 Greek x-mac-greek
UNSURE (mapping)
Seems iso-8859-7 is for Greek
7 Russian x-mac-cyrillic x-mac-cyrillic
8 RSymbol x-mac-symbol UNSURE
9 Devanagari x-mac-devanagari UNSURE IS 13194:1991 (ISCII-91)
x-iscii-de
10 Gurmukhi x-mac-gurmukhi UNSURE IS 13194:1991 (ISCII-91)
11 Gujarati x-mac-gujarati UNSURE x-iscii-gu
IS 13194:1991 (ISCII-91)
12 Oriya Oriya UNSURE
13 Bengali Bengali UNSURE
14 Tamil Tamil UNSURE
15 Telugu Telugu UNSURE
16 Kannada Kannada UNSURE
17 Malayalam Malayalam UNSURE
18 Sinhalese Sinhalese UNSURE
19 Burmese Burmese UNSURE
20 Khmer Khmer UNSURE
21 Thai iso-8859-11 iso-8859-11 / window-874 tis-620
22 Laotian Laotian UNSURE
23 Georgian Georgian UNSURE
24 Armenian Armenian UNSURE
25 Chinese (Simplified) hz-gb-2312 gbk / gb2312 euc-cn (gb2312)
hz-gb-2312 lable for TextDecoder is markd as replacement!
26 Tibetan Tibetan UNSURE Tibetan
27 Mongolian Mongolian UNSURE
28 Geez Geez UNSURE Inuit
there is an x-mac-inuit mapping in Fontkit.
29 Slavic x-mac-ce UNSURE (mapping)
30 Vietnamese Vietnamese UNSURE
31 Sindhi Sindhi
32 Uninterpreted NOT FOUND UNSURE
2
(ISO)
0 7-bit ASCII ascii ascii
1 ISO 10646 NOT FOUND UNSURE
2 ISO 8859-1 NOT FOUND iso-8859-1 / ascii / windows-1252
3
(Windows)
0 Symbol symbol UNSURE
1 Unicode BMP utf16be utf-16be
2 ShiftJIS shift-jis shift-jis / shift_jis
3 PRC gb18030 gb18030
4 Big5 big5 big5
5 Wansung x-cp20949 euc-kr KS X 1001
6 Johab johab UNSURE Windows Code Page is 1361.
the only available korean encoding in TextDecoder is euc-kr?
7 Reserved null
8 Reserved null
9 Reserved null
10 Unicode full repertoire utf16be utf-16be