frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
189 stars 44 forks source link

text/plain sources are not parsed as utf8 #210

Closed akariv closed 6 years ago

akariv commented 6 years ago

When loading a package from a text/plain source, its encoding is assumed to be 'iso-8859-1' (latin) instead of the standard 'utf-8'.

$ curl -vv <redacted>.json | od -a -tx2
...
< HTTP/1.1 200 OK
...
< Content-Type: text/plain
...
0000000    {   "   n   a   m   e   "   :   "   t   e   s   t   3   "   ,
             227b    616e    656d    3a22    7422    7365    3374    2c22
0000020    "   t   i   t   l   e   "   :   "   R   �   �   u   n   i   o
             7422    7469    656c    3a22    5222    a9c3    6e75    6f69

This is proper UTF8 for the letter “é”.

On the other hand,

>>> from datapackage import Package
p=Package('https://s3.amazonaws.com/rawstore.datahub.io/f539d44e5e88aa895f3dcff4a46d2ba6.json')
>>> p.descriptor
{'name': 'test3', 'title': 'Réunion', 'resources': [{'path': 'test.csv', 'pathType': 'local', 'name': 'test', 'format': 'csv', 'mediatype': 'text/csv', 'encoding': 'UTF-8', 'dialect': {'delimiter': ',', 'quoteChar': '"'}, 'schema': {'fields': [{'name': 'bar', 'type': 'integer', 'format': 'default'}, {'name': 'kel', 'type': 'integer', 'format': 'default'}, {'name': 'zhur', 'type': 'integer', 'format': 'default'}, {'name': 'tur', 'type': 'string', 'format': 'default'}], 'missingValues': ['']}, 'profile': 'data-resource'}], 'profile': 'data-package’}

is garbled.

akariv commented 6 years ago

The reason for this bug is that requests has that iso-8859-1 assumption which is simply irrelevant for jsons. Since json strings are always utf8 encoded, it's safe to override the encoding.