SWI-Prolog / packages-http

The SWI-Prolog HTTP server and client libraries
23 stars 23 forks source link

Incorrect parsing of JSON strings with surrogate pair escape sequences #158

Open mbrock opened 1 year ago

mbrock commented 1 year ago

The JSON string "\ud83d\udc95" has one codepoint, not two.

This is because the spec allows extended characters to be encoded as a pair of 16-bit values, called a "surrogate pair".

From RFC 4627:

To escape an extended character that is not in the Basic Multilingual                               
Plane, the character is represented as a twelve-character sequence,                                 
encoding the UTF-16 surrogate pair.  So, for example, a string                                      
containing only the G clef character (U+1D11E) may be represented as                                
"\uD834\uDD1E".

But SWI-Prolog's JSON parser reads that string as two (invalid) characters.

I have fixed this in my fork and will submit a pull request.