Tencent / rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
http://rapidjson.org/
Other
14k stars 3.5k forks source link

Deserialization fails on invalid unicode code point #2259

Open cmanallen opened 5 months ago

cmanallen commented 5 months ago

Version python-rapidjson==1.14. To reproduce: import rapidjson; rapidjson.loads('"\ud83c"') Error message: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed

\ud83c is not a valid unicode code point. Currently deserialization fails. This is uncommon behavior compared to other JSON parsers which deserialize it as an ASCII literal.

Consider the default Python JSON parser which returns the following given a valid and invalid unicode code point.

>>> json.loads('"\u266a"')
'♪'
>>> json.loads('"\ud83c"')
'\ud83c'

As opposed to rapidjson which returns:

>>> rapidjson.loads('"\u266a"')
'♪'
>>> rapidjson.loads('"\ud83c"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 1: surrogates not allowed