JuliaIO / JSON.jl

JSON parsing and printing
Other
311 stars 100 forks source link

JSON.parse fails for cweb/url-testing/master/urls.json #232

Open samoconnor opened 6 years ago

samoconnor commented 6 years ago

Input: https://raw.githubusercontent.com/cweb/url-testing/master/urls.json

julia> r = HTTP.get("https://raw.githubusercontent.com/cweb/url-testing/master/urls.json")
julia> JSON.parse(HTTP.payload(r, String))
ERROR: Expected 'u' here
Line: 36
Around: ...\\\ud800\\\u597D",          "e...
                    ^

Stacktrace:
 [1] _error(::String, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:135
 [2] skip! at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:76 [inlined]
 [3] read_unicode_escape!(::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:257
 [4] predict_string(::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/specialized.jl:60
 [5] parse_string(::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/specialized.jl:5
 [6] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:157
 [7] parse_object(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:218
 [8] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:161
 [9] parse_array(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:191
 [10] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:163
 [11] parse_object(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:218
 [12] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:161
 [13] parse_array(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:191
 [14] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:163
 [15] parse_object(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:218
 [16] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:161
 [17] parse_object(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:218
 [18] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64}, ::JSON.Parser.MemoryParserState) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:161
 [19] #parse#1(::Type{Dict{String,Any}}, ::Type{Int64}, ::Function, ::String) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:387
 [20] parse(::String) at /Users/sam/.julia/v0.6/JSON/src/Parser.jl:385
TotalVerb commented 6 years ago

Closing based on the following paragraph of the JSON standard:

8.2. Unicode Characters

When all the strings represented in a JSON text are composed entirely of Unicode characters [UNICODE] (however escaped), then that JSON text is interoperable in the sense that all software implementations that parse it will agree on the contents of names and of string values in objects and arrays.

However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

There is nothing we can sensibly return in this case. A fatal runtime exception is reasonable.

TotalVerb commented 6 years ago

This is of course a consequence of using UTF8, as opposed to the UTF16 that Webkit (or whoever these tests are intended for) are using. The unicode FAQ clarifies the situation of code points like \ud800, which are not characters and in fact even more restricted than non-characters, but rather reserved for UTF16 use only.

That correction results in the current situation for Unicode, where noncharacters are valid Unicode scalar values, are valid in Unicode strings, and must be mapped through UTFs, whereas surrogate code points are not valid Unicode scalar values, are not valid in UTFs, and cannot be mapped through UTFs.

samoconnor commented 6 years ago

Node handles this file without issues:

$ node
> var fs = require("fs");
> var content = fs.readFileSync("urls.json.txt");
> var jsonContent = JSON.parse(content);
> jsonContent.tests.group[0].test[4]
{ id: '5',
  name: 'Illegal Unicode half-surrogate U+D800',
  url: 'http://www.example.com/#\\�\\好',
  expect_url: 'http://www.example.com/#\\�\\好' }

I don't think the underlying encoding (UTF8 vs UTF16) is relevant when dealing with '\u' escape sequences. \ud800 is a 16-bit number, but it can be encoded in UTF8: Vector{UInt8}("\ud800") -> [0xed, 0xa0, 0x80]

My understanding is that there are no "invalid" Strings in Julia v0.7. @StefanKarpinski ? i.e String(::Vector{UInt8}) produces a valid String for any byte sequence:

while true
    s = String(rand(UInt8, 100))
    println(s)
    println([c for c in s])
end
julia> VERSION
v"0.7.0-DEV.3429"

julia> x = "foo\uDEADbar"
"foo\udeadbar"

julia> Vector{Char}(x)
7-element Array{Char,1}:
 'f'
 'o'
 'o'
 '\udead'
 'b'
 'a'
 'r'

With the patch below, the file parses as expected:

julia> x = JSON.parse(String(read("urls.json.txt")))
julia> x["tests"]["group"][1]["test"][5]
Dict{String,Any} with 4 entries:
  "name"       => "Illegal Unicode half-surrogate U+D800"
  "expect_url" => "http://www.example.com/#\\�\\好"
  "id"         => "5"
  "url"        => "http://www.example.com/#\\\ud800\\好"
diff --git a/src/Parser.jl b/src/Parser.jl
index de04aa0..fd66ef3 100644
--- a/src/Parser.jl
+++ b/src/Parser.jl
@@ -265,6 +265,7 @@ end

 function parse_string(ps::ParserState)
     b = UInt8[]
+    surrogate = UInt16(0x0000)
     incr!(ps)  # skip opening quote
     while true
         c = advance!(ps)
@@ -272,19 +273,29 @@ function parse_string(ps::ParserState)
         if c == BACKSLASH
             c = advance!(ps)
             if c == LATIN_U  # Unicode escape
-                append!(b, Vector{UInt8}(string(read_unicode_escape!(ps))))
+                u1 = read_four_hex_digits!(ps)
+                if surrogate > 0
+                    append!(b, Vector{UInt8}(string(utf16_get_supplementary(surrogate, u1))))
+                elseif utf16_is_surrogate(u1)
+                    surrogate = u1
+                else
+                    append!(b, Vector{UInt8}(string(Char(u1))))
+                end
+                continue
             else
                 c = get(ESCAPES, c, 0x00)
                 c == 0x00 && _error(E_BAD_ESCAPE, ps)
-                push!(b, c)
             end
-            continue
         elseif c < SPACE
             _error(E_BAD_CONTROL, ps)
         elseif c == STRING_DELIM
             return String(b)
         end

+        if surrogate > 0
+            append!(b, Vector{UInt8}(string(Char(surrogate))))
+            surrogate = UInt16(0x0000)
+        end
         push!(b, c)
     end
 end
diff --git a/src/specialized.jl b/src/specialized.jl
index e139c80..d8a05fd 100644
--- a/src/specialized.jl
+++ b/src/specialized.jl
@@ -1,4 +1,5 @@
 # Specialized functions for increased performance when JSON is in-memory
+#=
 function parse_string(ps::MemoryParserState)
     # "Dry Run": find length of string so we can allocate the right amount of
     # memory from the start. Does not do full error checking.
@@ -20,6 +21,7 @@ function parse_string(ps::MemoryParserState)

     String(b)
 end
+=#

 """
 Scan through a string at the current parser state and return a tuple containing
@@ -43,6 +45,7 @@ This function will throw an error if:

 No error is thrown when other invalid backslash escapes are encountered.
 """
+#=
 function predict_string(ps::MemoryParserState)
     e = endof(ps.utf8data)
     fastpath = true  # true if no escapes in this string, so it can be copied
@@ -123,6 +126,7 @@ function parse_string(ps::MemoryParserState, b)
     ps.s = s + 1
     b
 end
+=#

 function parse_number(ps::MemoryParserState)
     s = p = ps.s
samoconnor commented 6 years ago

the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD"

To me, this says that the specification allows "invalid" Unicode.

The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.

To me, "or even" says that runtime exceptions are the least desirable behaviour. Julia has well defined rules for length and indexing of Strings with "invalid" unicode.

TotalVerb commented 6 years ago

As far as I know Node encodes strings in UTF16, where these codepoints do make "sense".

I am aware that Julia v0.7 does support such codepoints, but whether we want to create them is another matter entirely. Even in Julia v0.7 we have the following:

julia> isvalid("\ud800")
false

(If we make this change, we would also want to discontinue support for v0.6, which treats such things as "invalid UTF8 data".) My interpretation of "surrogate code points are not valid Unicode scalar values, are not valid in UTFs, and cannot be mapped through UTFs" is that mapping these is a violation of the Unicode standard.

TotalVerb commented 6 years ago

Out of curiosity, what is your use case for parsing these half-surrogates? I can't imagine any well-formed "real" data will have them. If the use case is to parse non-well-formed data, I would argue that this is a non-standard use case and should be supported via non-standard string parsing contexts, which can be implemented like the existing non-standard integer contexts.

samoconnor commented 6 years ago

I think the fact that Julia separates isvalid from the representation of String is an argument that a de-serialisation parser should produce a String whenever it can, and let the user call isvalid if that is what they want.

what is your use case for parsing these half-surrogates?

The urls.json file that I'm parsing is a test suite for URL parsing. The "invalid" backslash-u sequences are part of test URLs that are intended to test that URL parsers handle "invalid" escape sequences. It seems to me that decisions about code-points being meaningful characters do not belong in the parser, they should be deferred to a layer that cares about the meaning of the characters. However, I guess JSON.jl could include a kw arg "ensure_valid_unicode=".

JSON only comes into this because the URL test suite originates in JavaScript and JSON is JavaScript's serialisation format. It seems to me that the simplest way to think about this is that if Julia is trying to be compatible with JavaScript's serialisation format, then it should handle anything that JavaScript can handle.

I can't imagine any well-formed "real" data will have them.

A serialisation/deserialisation mechanism should be tested for corner cases as well as "expected likely real data". Anything that can be deserialised to Julia, then reseraliased to the JSON without changing it should be handled. If Julia had no consistent way of representing U+D800 in a String, then there would be an argument for throwing an exception. However, it seems that @StefanKarpinski has gone to quite some effort to handle this kind of thing in the least-worst way, so we should take advantage of that.

See http://seriot.ch/parsing_json.php, under "Escaped Invalid Characters":

"In other words, parsers MUST parse u-escaped invalid codepoints, but the result is undefined,"

See also https://github.com/nst/JSONTestSuite Note that for the surrogate_U+D800 test:

TotalVerb commented 6 years ago

I will reopen this for additional discussion. I am still not convinced that creating invalid data by default is a safe policy. Perhaps we can default to replacing them with U+FFFD?

StefanKarpinski commented 6 years ago

What we've learned the hard way – i.e. by trying it and finding that it causes rampant problems – is that you should not have APIs where errors are unavoidably thrown due to bad data beyond your control. In such cases, the only solution is to use try/catch which is generally not considered good Julia code, especially in contexts where performance matters. So your other options here are:

  1. Accept invalid data without throwing an error and allow the user to check for it after producing a result whether the data was valid or not.
  2. Allow the user to check for invalid data in advance, handle invalid data somehow, and then get a result only once they're sure everything is valid.

The current situation in the JSON library seems to be neither of these and to fall very much into the "failing due to bad data beyond the user's control" camp, which is not good. In this case I can't see any reason not to treat invalid data the same way the new String code does: allow it, let the user check for it if they want to and for parsing purposes, simply consider invalid data to be "not equal" to any of the characters that JSON cares about. The most obvious way to encode unpaired UTF-16 surrogates as "UTF-8" in a String value is as WTF-8. I.e. encode the lone surrogate code unit according to the UTF-8 scheme; this results in invalid UTF-8, of course, but you got invalid UTF-16 in the first place. If the end-user asks if the string or character is valid, they'll get false (as they should). After I address https://github.com/JuliaLang/julia/issues/25452, the user will get an InvalidCharError if they try to convert the corresponding character to a code point value.

TotalVerb commented 6 years ago

After thinking it over, WTF-8 is not too terrible, although I would still eventually like this to be customizable through a StringContext (certainly I would like the ability to enforce validity without having to recursively descend through the parsed object).

@samoconnor would you be able to rebase the patch above without commenting out the performance optimization specialization, and submit it as a PR?

StefanKarpinski commented 6 years ago

certainly I would like the ability to enforce validity without having to recursively descend through the parsed object

What's so bad about recursively descending through the parsed object? If you want to find out if anything is invalid, that's easy and efficient to implement as a recursive function. If you want to replace invalid data with replacement characters or replace it with nothing, that's also easy to implement and the returned structure can share all objects that are fully valid without any copying, so unless you're expecting many objects to be invalid, it'll be efficient enough.

samoconnor commented 6 years ago

Hi @TotalVerb, that patch is just a proof-of-concept that shows that a simple hack can make the parser accept the test input. I don't know if it is suitable for production. e.g. it comments out most of specialized.jl. You'll need someone who understands the internals of JSON.jl to implement a real fix.

Something to consider: There might be advantages to adding a JSONString <: AbstractString struct that stores a raw JSON string (without decoding the escape sequences). This would defer the issue of validity to the point where the string is accessed. It might also have performance benefits for large input where only a few items are accessed.

TotalVerb commented 6 years ago

@StefanKarpinski The direction that JSON.jl has moved in lately is to allow greater customization of the parser through the ParserContext object, see https://github.com/JuliaIO/JSON.jl/pull/224 and https://github.com/JuliaIO/JSON.jl/pull/228. Any checks that make more sense to do at parse-time (such as whether an escape sequence is valid UTF-8) can be moved there to avoid needing to re-process the entire JSON object, as well as to be able to "short circuit" if bad data for a particular application is encountered early. I'm fine with making JSON.jl accepting by default, with ParserContext (probably through an owned StringContext) object controlling how the escape sequence is read.

@samoconnor I may have time later this week to look into it, and perhaps get started on ParserContext expansion at the same time. Thanks for the issue report and discussion!

samoconnor commented 6 years ago

Note also https://tools.ietf.org/html/rfc8259#section-9 "A JSON parser MUST accept all texts that conform to the JSON grammar." and the gramar is:

string = quotation-mark *char quotation-mark
char = unescaped / escape ( ... / %x75 4HEXDIG )  ;  uXXXX 
escape = %x5C              ; \

So there is really no choice for a compliant parser. You MUST accept the text irrespective of the higher-level grouping or meaning of \uXXXX sequences.

samoconnor commented 6 years ago

Something to consider: There might be advantages to adding a JSONString <: AbstractString struct that stores a raw JSON string (without decoding the escape sequences). This would defer the issue of validity to the point where the string is accessed. It might also have performance benefits for large input where only a few items are accessed.

FYI, this got me thinking enough to put together some proof of concept code: https://github.com/samoconnor/jsonhack

quinnj commented 6 years ago

I think that kind of implementation would be pretty easy to integrate w/ JSON2.jl

samoconnor commented 6 years ago

Implementation of JSON.String <: AbstractString: https://discourse.julialang.org/t/what-is-the-interface-of-abstractstring/8937

samoconnor commented 6 years ago

This file implements Julia's AbstractString interface for JSON strings in a way that: 1) avoids dealing with unescaping until characters are needed, 2) is accepting of "invalid" UTF-16 sequences: https://github.com/samoconnor/LazyJSON.jl/blob/master/src/AbstractString.jl

Feel free to borrow ideas/snippets/whole-files for use in JSON.jl.

Background info: https://discourse.julialang.org/t/announce-a-different-way-to-read-json-data-lazyjson-jl/9046