SWI-Prolog / packages-http

The SWI-Prolog HTTP server and client libraries
24 stars 23 forks source link

Fix #158: Handle surrogate pairs in http/json #159

Closed mbrock closed 1 year ago

mbrock commented 1 year ago

The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

This commit fixes the JSON parser to handle such surrogate pairs.

JanWielemaker commented 1 year ago

Thanks. Merged after some cosmetic changes. Weird people to allow for \uXXXX encoded UTF-16 inside UTF-8 encoded data ... This surely was not in the original JSON spec when I wrote this ...

mbrock commented 1 year ago

Thanks. Merged after some cosmetic changes. Weird people to allow for \uXXXX encoded UTF-16 inside UTF-8 encoded data ... This surely was not in the original JSON spec when I wrote this ...

Weird indeed! I ran into this with the JSON APIs of both Telegram and Readwise when dealing with emojis.

Thanks for merging, and for everything!