Closed goinnn closed 1 year ago
Thanks for the issue! Let me try to disentangle that a bit:
The byte string thing is a bit of a red herring, especially since you decode all the byte strings into Python strings anyway before passing them to json-stream. So a slightly simpler version to get the ValueError: No unicode character for code: 55356 at index 15
exception would be:
from io import StringIO
from json_stream import load
json_str = r'{"test": "\uD83C\uDFD4\uFE0F"}'
load(StringIO(json_str))["test"]
AFAICT the actual bug is improper handling of chains of multiple \u...
escape sequences in JSON. Single Unicode escape sequences, on the other hand, work fine:
json_str = r'{"test": "\u00C4"}'
print(load(StringIO(json_str))["test"]) # outputs 'Ä'
Why is that? The underlying issue is that the JSON standard mandates that if one decides to escape Unicode code points beyond U+FFFF
, it should be done using UTF-16 surrogate pair sequences, but json-stream's Rust tokenizer naively tries to parse each \u...
escape into a Rust char separately, and Rust chars specifically can't represent Unicode surrogate code points. So that has to be fixed.
json-stream's pure-Python tokenizer appears to be able to parse them without raising an exception, but Python doesn't really seem to like the strings this produces:
from io import StringIO
from json_stream.tokenizer import load, tokenize
json_str = r'{"test": "\uD83C\uDFD4\uFE0F"}'
result = load(StringIO(json_str), tokenizer=tokenize)['test']
print(result)
This will raise UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
in the print(result)
line. I'm not sure whether these are technically correct or not, because the JSON standard does say
However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
But I guess if Python strings, being pure UTF-8, can't contain these, then we should convert them to their UTF-8 equivalent anyway. So that should probably be fixed as well.
Finally, @goinnn, for your last point about quotation marks, that seems to be an issue with that wild chain of decoding and encoding you're doing rather than with json-stream:
>>> json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'
>>> json_bytes.decode('utf-8').encode('latin-1').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16'))
'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft "Neugebauer" zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnersta bis S 🚙"}'
The double quotes inside the JSON string lost the backslash that escaped them, so the JSON string does in fact end before Neugebauer
then. Since json-stream yields that string before parsing further, it will only raise an error due to the now-malformed JSON if you request the whole dictionary, e.g. by doing list(load(...).items())
instead of load(...)["test"]
.
Thanks @smheidrich, so are there three bugs?
@goinnn Just one that's relevant for you, which should be fixed in json-stream-rs-tokenizer >= 0.4.17
, so you could try updating and seeing if it works now.
Hi @smheidrich,
Now works in our cases. Thanks!! 🥳🥳
import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnerstag bis S \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']
Result: 'Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnerstag bis S 🚙'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True
import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "\xc4\x8d \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']
Result: 'č 🚙'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True
import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']
Result: 'Direkt auf dem Parkplatz der Gastwirtschaft "Neugebauer" zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnersta bis S 🚙'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True
This seems to be an issue in the python tokenizer too...
First, some background, at least as I understand it!
In python, the str
type is not in any encoding (UTF-8/16/latin-1/whatever). It is a "unicode string", which mean a simple sequence of unicode code points, where a unicode code point is just a number such as U+c4 (Ä) or U+1D11E (𝄞).
When outputting unicode to a file/socket or the terminal, which are natively byte-oriented, unicode strings must be converted into a 8-bit byte format. Typically this format is UTF-8.
JSON is natively unicode, in that it is a unicode string, and when transmitted/written is always encoded into UTF-8, so unicode code points don't have to be escaped at all, they can be represented literally.
However, within a JSON string, code points may be escaped (replaced with an escape sequence such as \uXXXX
). This only works for characters in the Basic Multilingual Plane (up to U+FFFF). Beyond that, the JSON standard requires that if you are escaping you must encode that code point in UTF-16, as a UFT-16 surrogate pair which has been escaped into \uXXXX\uXXXX
.
Therefore, it is perfectly valid to have JSON containing either a single escaped non-surrogate unicode code point (i.e. \uXXXX
) or a pair of escaped surrogate code points (\uXXXX\uXXXX
), and we should handle this in the tokenizer!
I'm currently working on getting the following tests working:
def test_unicode(self):
data = json.dumps('Ä') # data is a JSON: "\u00c4"
result = load(StringIO(data))
self.assertEqual(result, 'Ä')
def test_non_bmp_unicode(self):
data = json.dumps('𝄞') # data is JSON: "\ud834\udd1e"
result = load(StringIO(data))
self.assertEqual(result, '𝄞')
These tests take a python string containing a single unicode code point, encode it into JSON using the standard json
library, and decode the resulting JSON string back to a python string. This operation should result in the output string matching the input string exactly.
In test_unicode()
, the character is within the BMP, and the data
that is produced contains a single \uXXXX
escape sequence: "\u00c4"
. This is handled in all cases (by both the python and rust tokenizers).
In test_non_bmp_unicode()
, the character is from the Supplementary Multilingual Plane, and data
is there for in the UTF-16 surrogate pair form \uXXXX\uXXXX:
"\ud834\udd1e"`.
Using the python tokenizer, decoding this JSON string results in a unicode string containing the two separate individual code points for the surrogates themselves.
Versions of the rust tokenizer prior to 0.4.17 also fail this test, but with an error, as the rust char
type specifically excludes surrogates. Version 0.4.17 correctly decodes the surrogate pair to a single unicode code point.
@smheidrich perhaps there are things you could take from #42 that will improve the rest tokenizer?
@daggaz I'll probably do just that, as I also thought that needing 4 different states, one for each digit of a Unicode codepoint, seems excessive, but decided against refactoring it to maintain some similarity to the Python tokenizer. With that barrier gone, I'll probably copy your approach :) Thanks.
this change to the python tokenizer has now been merged, and a new release 2.3.1 will be available shortly
json-stream should accept byte string because we can not parse some JSON with json-stream but we can parse them with json library.
Using json library works:
Using json_stream library does not work:
We can do something like this:
But this does not work with another byte strings:
But this works perfectly with json library:
Also there is another relational problem. The original json_bytes variable contained quotation marks:
If I run the solution, json-stream cut the result: