frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
701 stars 147 forks source link

Question: Which encoding is used when saving german umlauts to JSON #541

Closed The-Aniliner-Gerstung closed 3 years ago

The-Aniliner-Gerstung commented 3 years ago

Overview

Hello,

I'm trying to save datapackages, resources and table schemas as JSON using the built in .to_json() function. Problem is, that I have german umlauts (Ä,Ö,Ü,ß) and exponents (e.g. m², m³) in my meta data.

When opening the resulting JSON file in PyCharm, it tries to open that with UTF-8 encoding. This results in unrecognised characters and a warning from PyCharm, because the file seems to be encoded in ISO 8859-1 (see screenshot): image

That's how the file should look like: image

Other editors (like Windows Notepad or Notepad++) recognise the encoding correctly.

My question is, when is fricitonless using UTF-8 and when other encodings? Why is it not saving in UTF-8 at all times and escaping unicode characters, since the JSON specification (RFC 7159, Chapter 8.1) specifies UTF-8 as standard encoding?

Thanks in advance and keep up the good work!


Python-Code to reproduce this issue:

import frictionless

pack = frictionless.Package()
pack.name = "name-of-package"
pack.description = "öäü ÖÄÜ ß m² m³ and some other text"

pack.to_json("datapackage_test.json")

Please preserve this line to notify @roll (lead of this repository)

roll commented 3 years ago

Hi @The-Aniliner-Gerstung,

Thanks, I'll investigate. We use Python's json module so it's not clear to me yet why there is an encoding problem

The-Aniliner-Gerstung commented 3 years ago

Thank you for the very quick response, @roll! That has probably something to do with the NamedTemporaryFile opening? You can specify the encoding there (line 115, metadata.py): image

The second example from the original post was created using the usual

with open("file.json", encoding="UTF-8") as f: 
   json.dump(artifact.to_dict(), f, indent=2, ensure_ascii=False)

which creates valid UTF-8 (although the special characters are not escaped, which should be done following the JSON specification)

roll commented 3 years ago

Great! Thanks for the quick analysis.

I'm going to fix it this week. If you're interested feel free to PR adding a (failing -> fixed) test

The-Aniliner-Gerstung commented 3 years ago

That sounds great, thank you very much! I just tried contributing to your repository but unfortunately my company development environment won't let me install the dependencies.

I have prepared a test file for encoding for you aswell as a valid UTF-8 encoded datapackage (needed for the second test):

test_file_encoding.zip

The-Aniliner-Gerstung commented 3 years ago

Hello @roll , thank you for your very quick fix. I just downloaded the newest version 3.34.2 and tested the encoding. Files are now written correctly but I noticed, that reading a new UTF-8 JSON leads to problems. I guess the with open() function should also have encoding="utf8" as parameter.

I attached you a little script I ran before and after updating the framework (see the version inside the JSON). After the update the read() function fails.

fl-encoding-test.zip

roll commented 3 years ago

Thanks!

Sorry for these errors it's really hard to catch them as it depends on the user's locale so our CI doesn't catch them. I'm releasing a fix

hoffch commented 3 years ago

Thanks for fixing this issue - works like a charm 👍 And BTW: your release rate is impressing - keep going!

farhadmaleki85 commented 1 year ago

hey guys I have the same issue with German umlauts but I did not really underestand the solution, could somone please explain ?

roll commented 1 year ago

Hi @farhadmaleki85 can you please create a new issue with your problem description?

tobbarth commented 1 year ago

@farhadmaleki85 this worked for me, because i didn't need the string, but the direct output into a json file https://stackoverflow.com/questions/18337407/saving-utf-8-texts-with-json-dumps-as-utf-8-not-as-a-u-escape-sequence