MarcosLopezC / LightJson

A simple JSON library for C#.
MIT License
60 stars 21 forks source link

Incorrect `JsonParseException` for some control characters #43

Closed ElteHupkes closed 2 years ago

ElteHupkes commented 5 years ago

From the official JSON specification, section 9 (emphasis mine):

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All code points may be placed within the quotation marks except for the code points that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

JsonReader checks for invalid control characters using char.IsControl(), and from its documentation:

The Unicode standard assigns code points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to control characters.

In other words, there is a range of characters which are designated Unicode control characters (and thus return a positive result from char.IsControl()) which are not deemed control characters according to the JSON standard. These characters will therefore not be escaped by many JSON serializers, which may cause an invalid JsonParseException to be thrown when the JSON contains them.

Of course this is all very unlikely, and of course therefore it just happened to me while consuming a JSON API and I spent the last few hours debugging the problem (turns out, \u007f doesn't quite show up in string output!).

EDIT: From the top of my head, as a suggested fix, I think replacing char.IsControl(c) with c <= '\u001f' should do the trick.

nbcks commented 2 years ago

Hey Elte Hupkes!

I have the fix for this but I do it with not equal to backslash or double quote. I believe your suggestion would introduce a bug where json such as { "hello" : "\" } would be allowed!

Looking at the spec: https://datatracker.ietf.org/doc/html/rfc8259#section-7

   Plane, the character is represented as a 12-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

      string = quotation-mark *char quotation-mark

      char = unescaped /
          escape (
              %x22 /          ; "    quotation mark  U+0022
              %x5C /          ; \    reverse solidus U+005C
              %x2F /          ; /    solidus         U+002F
              %x62 /          ; b    backspace       U+0008
              %x66 /          ; f    form feed       U+000C
              %x6E /          ; n    line feed       U+000A
              %x72 /          ; r    carriage return U+000D
              %x74 /          ; t    tab             U+0009
              %x75 4HEXDIG )  ; uXXXX                U+XXXX

      escape = %x5C              ; \

      quotation-mark = %x22      ; "

      unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

I do not check for > 0x10FFFF as this is out of the range of the character (However this may also be a bug - I believe I need to peek ahead to get two bytes due to UTF-16).

Let me make a pull request

nbcks commented 2 years ago

Sorry @ElteHupkes you are correct as blackslash and double quotes are already handled above. You only need to check for < 0x20 or <= 0x1F as you say!

nbcks commented 2 years ago

https://github.com/MarcosLopezC/LightJson/pull/44

ElteHupkes commented 2 years ago

I would like to say I can still comment on the specifics, but after two and a half years they seem to have slipped my working memory 😉. Great to see this fixed though!