JuliaIO / JSON.jl

JSON parsing and printing
Other
313 stars 101 forks source link

Reading past end of JSON data in a stream #239

Open ScottPJones opened 6 years ago

ScottPJones commented 6 years ago

While investigating an issue with my changes for better performance and to fix numeric parsing issues, I realized that parsing a number can "eat" the next character in a stream. The way that the async.jl unit test works, by writing out a stream of JSON values, and then reading them on another process, breaks if there is not another character to separate the values, and just a number is being read. I attempted to fix this using Base.peek, however on v0.6.2, it doesn't work for the type TCPStream. I'm not sure what the behavior of a similar test in JavaScript or Python would be.

TotalVerb commented 6 years ago

We should put back any characters we are eating. Is there a reason peek doesn't work on TCPStream?

ScottPJones commented 6 years ago

For one thing, peek isn't even exported, and it is not implemented for TCPStream in v0.6.2. I haven't tried yet on master.

(The peek implementation looked like it might hurt performance a lot, also, at least one of the methods, that does a try / finally and saves / restores the position in the stream, after reading a single byte) A better approach in that case would be to only restore the position in the case of reading a JSON number when finding something that isn't a digit (after a -, ., e, E or + (in an exponent), one or more digits are always required).

TotalVerb commented 6 years ago

What I mean is, is there a reason why peek is not implemented (in Base), as it seems this would be a really useful feature overall?

ScottPJones commented 6 years ago

I'm not sure why it was left unexported in master, but at least there it is better supported (instead of just 2 methods, it has 7, including peek(s::Base.LibuvStream) in Base at stream.jl:1182, which should handle TCPSocket. (I misremembered the name before, it's TCPSocket instead of TCPStream)

samoconnor commented 6 years ago

doesn't work for the type TCPStream.

You'll likely also run into trouble with wrappers like SSLContext (almost every HTTP connection is a HTTPS connection nowdays).

To me having a peek that only returns one character is a bit arbitrary. I think it would be more useful to have a peek that returns a view of however many bytes are available in whatever buffer the IO implementation has. e.g. it would be handy to peek in the buffer for presence of e.g. \r\n\r\n before reading.

ScottPJones commented 6 years ago

One character (really, 1 byte) look-ahead is all that the JSON format needs, and that is only for numbers (everything else is terminated by a byte that is known in advance, i.e. '}', ']', ':'or '"'). Having a function that could read the entire buffer of ready to be read bytes would also be quite useful, for other things.

peteristhegreat commented 3 years ago

I hit this bug today while accidentally reading some SOAP responses. Having the response wiped while doing a JSON read feels like a bug. In this example the original string is preserved, but in my debugging session, I was dealing with the array output initially not a string.

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body)
28-element Array{UInt8,1}:
 0x3c
 0x58
 0x4d
 0x4c
 0x3e
 0x54
 0x68
 0x69
 0x73
 0x20
 0x69
 0x73
 0x20
 0x6e
 0x6f
 0x74
 0x20
 0x4a
 0x53
 0x4f
 0x4e
 0x3c
 0x2f
 0x58
 0x4d
 0x4c
 0x3e
 0x0a

julia> JSON.parse(String(array))
ERROR: Unexpected character
Line: 0
Around: ...<XML>This is not JSON<...
            ^

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] _error(::String, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:140
 [3] parse_jsconstant(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:193
 [4] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:170
 [5] #parse#1(::Type, ::Type{Int64}, ::Bool, ::Nothing, ::typeof(JSON.Parser.parse), ::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:460
 [6] parse(::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:458
 [7] top-level scope at REPL[19]:1

julia> array
0-element Array{UInt8,1}

julia> body
"<XML>This is not JSON</XML>\n"

Basically, reading off the end of the array causes the array to get dropped. I would rather it not drop the array.

It looks like this thread here is related.

https://github.com/JuliaLang/julia/issues/24388

KristofferC commented 3 years ago

How is this related to JSON.jl?

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body);

julia> String(array);

julia> array
UInt8[]