json-stream tokenizer unicode surrogate handling

goinnn commented 1 year ago

json-stream should accept byte string because we can not parse some JSON with json-stream but we can parse them with json library.

Using json library works:

import json
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnerstag bis S \\uD83D\\uDE99"}'
import json
json.loads(json_bytes)['test']

Result: 'Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnerstag bis S 🚙'

Using json_stream library does not work:

 import json_stream
 from io import StringIO
 json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnerstag bis S \\uD83D\\uDE99"}'
 json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']

 raise a exception (ValueError: No unicode character for code: 55357 at index 162)

We can do something like this:

 import json_stream
 from io import StringIO
 json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnerstag bis S \\uD83D\\uDE99"}'
 json_stream.load(StringIO(json_bytes.decode('utf-8').encode('latin-1').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')))['test']

 Result: 'Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnerstag bis S 🚙'

But this does not work with another byte strings:

 import json_stream
 from io import StringIO
 json_bytes = b'{"test": "\xc4\x8d \\uD83D\\uDE99"}'
 json_stream.load(StringIO(json_bytes.decode('utf-8').encode('latin-1').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')))['test']

raise a exception UnicodeEncodeError: 'latin-1' codec can't encode character '\u010d' in position 10: ordinal not in range(256)

But this works perfectly with json library:

import json
json_bytes = b'{"test": "\xc4\x8d \\uD83D\\uDE99"}'
json.loads(json_bytes)['test']

Result: 'č 🚙'

Also there is another relational problem. The original json_bytes variable contained quotation marks:

json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'

If I run the solution, json-stream cut the result:

import json_stream
from io import StringIO
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8').encode('latin-1').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')))['test']

Result: 'Direkt auf dem Parkplatz der Gastwirtschaft '

smheidrich commented 1 year ago

Thanks for the issue! Let me try to disentangle that a bit:

The byte string thing is a bit of a red herring, especially since you decode all the byte strings into Python strings anyway before passing them to json-stream. So a slightly simpler version to get the ValueError: No unicode character for code: 55356 at index 15 exception would be:

from io import StringIO
from json_stream import load

json_str = r'{"test": "\uD83C\uDFD4\uFE0F"}'
load(StringIO(json_str))["test"]

AFAICT the actual bug is improper handling of chains of multiple \u... escape sequences in JSON. Single Unicode escape sequences, on the other hand, work fine:

json_str = r'{"test": "\u00C4"}'
print(load(StringIO(json_str))["test"])  # outputs 'Ä'

Why is that? The underlying issue is that the JSON standard mandates that if one decides to escape Unicode code points beyond U+FFFF, it should be done using UTF-16 surrogate pair sequences, but json-stream's Rust tokenizer naively tries to parse each \u... escape into a Rust char separately, and Rust chars specifically can't represent Unicode surrogate code points. So that has to be fixed.

json-stream's pure-Python tokenizer appears to be able to parse them without raising an exception, but Python doesn't really seem to like the strings this produces:

from io import StringIO
from json_stream.tokenizer import load, tokenize

json_str = r'{"test": "\uD83C\uDFD4\uFE0F"}'
result = load(StringIO(json_str), tokenizer=tokenize)['test']
print(result)

This will raise UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed in the print(result) line. I'm not sure whether these are technically correct or not, because the JSON standard does say

However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

But I guess if Python strings, being pure UTF-8, can't contain these, then we should convert them to their UTF-8 equivalent anyway. So that should probably be fixed as well.

Finally, @goinnn, for your last point about quotation marks, that seems to be an issue with that wild chain of decoding and encoding you're doing rather than with json-stream:

>>> json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'
>>> json_bytes.decode('utf-8').encode('latin-1').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16'))
'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft "Neugebauer" zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnersta bis S 🚙"}'

The double quotes inside the JSON string lost the backslash that escaped them, so the JSON string does in fact end before Neugebauer then. Since json-stream yields that string before parsing further, it will only raise an error due to the now-malformed JSON if you request the whole dictionary, e.g. by doing list(load(...).items()) instead of load(...)["test"].

goinnn commented 1 year ago

Thanks @smheidrich, so are there three bugs?

smheidrich commented 1 year ago

@goinnn Just one that's relevant for you, which should be fixed in json-stream-rs-tokenizer >= 0.4.17, so you could try updating and seeing if it works now.

goinnn commented 1 year ago

Hi @smheidrich,

Now works in our cases. Thanks!! 🥳🥳

import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnerstag bis S \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']

Result: 'Direkt auf dem Parkplatz der Gastwirtschaft Neugebauer zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnerstag bis S 🚙'

json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True

import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "\xc4\x8d \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']

Result: 'č 🚙'

json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True

import json
import json_stream
from io import StringIO
json_bytes = b'{"test": "Direkt auf dem Parkplatz der Gastwirtschaft \\"Neugebauer\\" zu finden, 24/7 \xc3\xb6ffentlich Zug\xc3\xa4nglich,\\r\\nLeckere Spei\xc3\x9fen und Getr\xc3\xa4nke von Donnersta bis S \\uD83D\\uDE99"}'
json_stream.load(StringIO(json_bytes.decode('utf-8')))['test']

Result: 'Direkt auf dem Parkplatz der Gastwirtschaft "Neugebauer" zu finden, 24/7 öffentlich Zugänglich,\r\nLeckere Speißen und Getränke von Donnersta bis S 🚙'

json_stream.load(StringIO(json_bytes.decode('utf-8')))['test'] == json.loads(json_bytes)['test']
Result: True

daggaz commented 1 year ago

This seems to be an issue in the python tokenizer too...

First, some background, at least as I understand it!

In python, the str type is not in any encoding (UTF-8/16/latin-1/whatever). It is a "unicode string", which mean a simple sequence of unicode code points, where a unicode code point is just a number such as U+c4 (Ä) or U+1D11E (𝄞).

When outputting unicode to a file/socket or the terminal, which are natively byte-oriented, unicode strings must be converted into a 8-bit byte format. Typically this format is UTF-8.

JSON is natively unicode, in that it is a unicode string, and when transmitted/written is always encoded into UTF-8, so unicode code points don't have to be escaped at all, they can be represented literally.

However, within a JSON string, code points may be escaped (replaced with an escape sequence such as \uXXXX). This only works for characters in the Basic Multilingual Plane (up to U+FFFF). Beyond that, the JSON standard requires that if you are escaping you must encode that code point in UTF-16, as a UFT-16 surrogate pair which has been escaped into \uXXXX\uXXXX.

Therefore, it is perfectly valid to have JSON containing either a single escaped non-surrogate unicode code point (i.e. \uXXXX) or a pair of escaped surrogate code points (\uXXXX\uXXXX), and we should handle this in the tokenizer!

I'm currently working on getting the following tests working:


    def test_unicode(self):
        data = json.dumps('Ä')  # data is a JSON: "\u00c4"
        result = load(StringIO(data))
        self.assertEqual(result, 'Ä')

    def test_non_bmp_unicode(self):
        data = json.dumps('𝄞')  # data is JSON: "\ud834\udd1e"
        result = load(StringIO(data))
        self.assertEqual(result, '𝄞')

These tests take a python string containing a single unicode code point, encode it into JSON using the standard json library, and decode the resulting JSON string back to a python string. This operation should result in the output string matching the input string exactly.

In test_unicode(), the character is within the BMP, and the data that is produced contains a single \uXXXX escape sequence: "\u00c4". This is handled in all cases (by both the python and rust tokenizers).

In test_non_bmp_unicode(), the character is from the Supplementary Multilingual Plane, and data is there for in the UTF-16 surrogate pair form \uXXXX\uXXXX:"\ud834\udd1e"`.

Using the python tokenizer, decoding this JSON string results in a unicode string containing the two separate individual code points for the surrogates themselves.

Versions of the rust tokenizer prior to 0.4.17 also fail this test, but with an error, as the rust char type specifically excludes surrogates. Version 0.4.17 correctly decodes the surrogate pair to a single unicode code point.

daggaz commented 1 year ago

@smheidrich perhaps there are things you could take from #42 that will improve the rest tokenizer?

smheidrich commented 1 year ago

@daggaz I'll probably do just that, as I also thought that needing 4 different states, one for each digit of a Unicode codepoint, seems excessive, but decided against refactoring it to maintain some similarity to the Python tokenizer. With that barrier gone, I'll probably copy your approach :) Thanks.

daggaz commented 1 year ago

this change to the python tokenizer has now been merged, and a new release 2.3.1 will be available shortly

daggaz / json-stream

json-stream tokenizer unicode surrogate handling #39