frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
189 stars 44 forks source link

Ensure dump.to_path|zip saves UTF-8 descriptor (ATM it is ascii) #209

Closed AcckiyGerman closed 6 years ago

AcckiyGerman commented 6 years ago

the problem

"title": "мамы" becomes "title": "\u043c\u0430\u043c\u044b" after dump.to_file

the reason

Python's json.dump[s] save json file in ascii encoding by default: https://docs.python.org/3/library/json.html#basic-usage

ensure_ascii=True

So does the DPP - saves descriptors in the ASCII despite the file encoding is utf-8: https://github.com/frictionlessdata/datapackage-py/blob/0bc5276c2acd730fd765af99dbfdecbca9b1c46d/datapackage/package.py#L249 https://github.com/frictionlessdata/datapackage-py/blob/0bc5276c2acd730fd765af99dbfdecbca9b1c46d/datapackage/package.py#L269

how it should be:

Here is a standard of JSON: https://tools.ietf.org/html/rfc7159#section-8.1

8.1. Character Encoding JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32). Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

@akariv @roll Do you have any objections if I fix it ?

akariv commented 6 years ago

@AcckiyGerman "title": "мамы" and "title": "\u043c\u0430\u043c\u044b" are equivalent, both decode to exactly the same thing and are both valid under the JSON standard (as ASCII is a subset of UTF-8). The latter is more compatible with transports which prefer ASCII only (e.g. emails...). Why is this important?

AcckiyGerman commented 6 years ago

@akariv Yes, you are right, thanks. I've tested loading such a descriptor and package.descriptor display the title correctly in the terminal.

So it is important only if humans are reading the raw datapackage.json

AcckiyGerman commented 6 years ago

Wontfix to leave compatibility with transports which prefer ASCII.