Closed mbrock closed 1 year ago
Thanks. Merged after some cosmetic changes. Weird people to allow for \uXXXX
encoded UTF-16 inside UTF-8 encoded data ... This surely was not in the original JSON spec when I wrote this ...
Thanks. Merged after some cosmetic changes. Weird people to allow for
\uXXXX
encoded UTF-16 inside UTF-8 encoded data ... This surely was not in the original JSON spec when I wrote this ...
Weird indeed! I ran into this with the JSON APIs of both Telegram and Readwise when dealing with emojis.
Thanks for merging, and for everything!
The JSON string "\ud83d\udc95" has one codepoint, not two.
This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".
From RFC 4627:
This commit fixes the JSON parser to handle such surrogate pairs.