Open chazeon opened 6 years ago
@chazeon Could you please provide some examples? Currently Windows NAME entries are encoded in UTF-8, but for other entries it would use BASE64, since they may contain legacy encoding.
@be5invis I don't know if you can get hold of any Founder Type font, I get one from their official website for example, FZXSSK (方正新书宋_GBK) version 1.0 has this kind of problem.
I dump the FZXSSK.TTF
to FZXSSK.json
using
.\otfccdump.exe .\FZXSSK.TTF -o .\FZXSSK.json
with otfcc release 0.9.6 then run Python 3 and get problems like these
>>> import json
>>> with open('FZXSSK.json', 'r') as f:
... json.load(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\ProgramData\Anaconda3\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 5445: illegal multibyte sequence
>>> with open('FZXSSK.json', 'r', encoding='utf8') as f:
... json.load(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\ProgramData\Anaconda3\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\ProgramData\Anaconda3\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 3718: invalid start byte
>>>
I do not dig in this time, but from my previous experience, the copyright line at least, could cause the problem. The 0xb1
byte, is the starting byte of GBK encoded of "北大方正...", which is "\xb1\xb1\xb4\xf3\xb7\xbd\xd5\xfd..."
They are encoding CJK characters in Mac Roman encoding?
I realized I have mistaken the platform Mac for Windows. I am not familiar with Mac Roman encoding, but it seems they are doing something like that.
Dumped JSON file, contains encoded strings, which could error prone while decoding in scripting languages like Python.
These encoded strings, especially those in the
name
table, are directly dumped as bytes, and are not always uniformly encoded because they are stored encoded in fonts as bytes when the corresponding platform is Windows.JSON decoding, for example, in Python 3, the
json.load()
orjson.loads()
acceptstr
instead ofbytes
, and when we try to decode thebytes
, the problem would occurs because the JSON file could contain mixed encoded bytes. The same could also happen when we consider the JSON generation problem in Python 3. Most third party JSON parsing packages in Python faces the same problem. And many C/C++ JSON parsing library also assume the JSON should use Unicode encoding (RapidJSON, for example). But there are ShiftJIS and GBK in these dumped bytes for many Chinese and Japanese fonts. Decisions made by these software is reasonable because according to RFC-7159, section 8.1,Similar articles appear in ECMA-404 as well.
I definitely can write a JSON parsing package based on YAJL's tokenizer and its corresponding JSON parser, which uses
bytes
as its sole data exchanging type even in Python 3, and I am actually working on one. However, I am hoping there could be a more elegant solution. For example, when dump JSON, use base64 to do an additional encoding, or decode them and encode to Unicode before dump. Or provide a manipulate API for other language is also preferable.Hope these information is self-contained and can help you understand the problem.