JuliaIO / JSON.jl

JSON parsing and printing
Other
311 stars 100 forks source link

Allow for parsing multiple JSON objects in a single string/stream #344

Open mcognetta opened 2 years ago

mcognetta commented 2 years ago

Some APIs that accept batch requests return a sequence of separate JSON objects that are not delimited in any way, but by parsing them you can tell they are separate as when one complete JSON object is parsed, the next non-whitespace character will start the next object

For example, you might see an string like {"name":"Marco"} {"name":"Julia"}, representing two distinct JSON objects.

Currently, JSON.jl does not parse this correctly. It errors for the string case, and only parses the first object in the streaming case (without any indication that the stream was not exhausted).

julia> s = "{\"name\":\"Marco\"} {\"name\":\"Julia\"}"
"{\"name\":\"Marco\"} {\"name\":\"Julia\"}"

julia> JSON.parse(s)
ERROR: Expected end of input
Line: 0
Around: ...{"name":"Marco"} {"name":"Julia"}...
                            ^

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] _error(message::String, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/packages/JSON/QXB8U/src/Parser.jl:140
 [3] parse(str::String; dicttype::Type, inttype::Type{Int64}, allownan::Bool, null::Nothing)
   @ JSON.Parser ~/.julia/packages/JSON/QXB8U/src/Parser.jl:453
 [4] parse(str::String)
   @ JSON.Parser ~/.julia/packages/JSON/QXB8U/src/Parser.jl:448
 [5] top-level scope
   @ REPL[8]:1

julia> JSON.parse(IOBuffer(s))
Dict{String, Any} with 1 entry:
  "name" => "Marco"

Under the assumption that all JSON objects in the string have the same dicttype, I believe this can be extended to return a list of parsed objects. My first attempt is:

function parsemany(str::AbstractString;
               dicttype=Dict{String,Any},
               inttype::Type{<:Real}=Int64,
               allownan::Bool=true,
               null=nothing)
    out = Vector{dicttype}()
    pc = _get_parsercontext(dicttype, inttype, allownan, null)
    ps = MemoryParserState(str, 1)
    v = parse_value(pc, ps)
    push!(out, v)
    chomp_space!(ps)
    while hasmore(ps)
        pc = _get_parsercontext(dicttype, inttype, allownan, null)
        v = parse_value(pc, ps)
        push!(out, v)
        chomp_space!(ps)        
    end
    out
end

Example:

julia> JSON.parsemany(s)
2-element Vector{Dict{String, Any}}:
 Dict("name" => "Marco")
 Dict("name" => "Julia")

# correctly errors on a malformed JSON object
julia> JSON.parsemany(s[1:end-1])
ERROR: Unexpected end of input
Line: 0
Around: ...":"Julia"...
                    ^

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] _error(message::String, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:140
 [3] byteat
   @ ~/.julia/dev/JSON/src/Parser.jl:49 [inlined]
 [4] parse_object(pc::JSON.Parser.ParserContext{Dict{String, Any}, Int64, true, nothing}, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:233
 [5] parse_value(pc::JSON.Parser.ParserContext{Dict{String, Any}, Int64, true, nothing}, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:166
 [6] parsemany(str::String; dicttype::Type, inttype::Type{Int64}, allownan::Bool, null::Nothing)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:472
 [7] parsemany(str::String)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:464
 [8] top-level scope
   @ REPL[10]:1

# notice the first object is not properly closed
julia> s = "{\"name\":\"Marco\" {\"name\":\"Julia\"}"
"{\"name\":\"Marco\" {\"name\":\"Julia\"}"

# fails to parse
julia> JSON.parsemany(s)
ERROR: Expected ',' here
Line: 0
Around: ...{"name":"Marco" {"name":"Julia"}...
                           ^

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] _error(message::String, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:140
 [3] _error_expected_char(c::UInt8, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:83
 [4] skip!
   @ ~/.julia/dev/JSON/src/Parser.jl:80 [inlined]
 [5] parse_object(pc::JSON.Parser.ParserContext{Dict{String, Any}, Int64, true, nothing}, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:234
 [6] parse_value(pc::JSON.Parser.ParserContext{Dict{String, Any}, Int64, true, nothing}, ps::JSON.Parser.MemoryParserState)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:166
 [7] parsemany(str::String; dicttype::Type, inttype::Type{Int64}, allownan::Bool, null::Nothing)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:467
 [8] parsemany(str::String)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:464
 [9] top-level scope
   @ REPL[16]:1

# note the second one is not properly opened
julia> s = "{\"name\":\"Marco\"} \"name\":\"Julia\"}"
"{\"name\":\"Marco\"} \"name\":\"Julia\"}"

# fails, though this case should have a better error message in the final version
julia> JSON.parsemany(s)
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Dict{String, Any}
Closest candidates are:
  convert(::Type{T}, ::T) where T<:AbstractDict at abstractdict.jl:520
  convert(::Type{T}, ::AbstractDict) where T<:AbstractDict at abstractdict.jl:522
  convert(::Type{T}, ::T) where T at essentials.jl:205
  ...
Stacktrace:
 [1] push!(a::Vector{Dict{String, Any}}, item::String)
   @ Base ./array.jl:932
 [2] parsemany(str::String; dicttype::Type, inttype::Type{Int64}, allownan::Bool, null::Nothing)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:473
 [3] parsemany(str::String)
   @ JSON.Parser ~/.julia/dev/JSON/src/Parser.jl:464
 [4] top-level scope
   @ REPL[18]:1

# works even with no space
julia> s = "{\"name\":\"Marco\"}{\"name\":\"Julia\"}"
"{\"name\":\"Marco\"}{\"name\":\"Julia\"}"

julia> JSON.parsemany(s)
2-element Vector{Dict{String, Any}}:
 Dict("name" => "Marco")
 Dict("name" => "Julia")

Is this an acceptable addition to JSON.jl? One argument on its behalf is that, while a user could split the string themselves, that is basically the same as writing a JSON parser themselves, as they have to correctly handle all of the edge cases, nesting, etc in order to determine where the outermost opening and closing brackets are. Without access to the internal helper methods of JSON.jl, this is a bit of a big ask.

mcognetta commented 2 years ago

I have noticed that wrapping multiple JSON objects in [ ] with a comma separator causes this to parse correctly:

julia> s = "[{\"name\":\"Marco\"}, {\"name\":\"Julia\"}]"
"[{\"name\":\"Marco\"}, {\"name\":\"Julia\"}]"

julia> JSON.parse(s)
2-element Vector{Any}:
 Dict{String, Any}("name" => "Marco")
 Dict{String, Any}("name" => "Julia")

This still leaves the problem of converting a non-delimited multiple-object JSON string to a comma separated one that can be wrapped in brackets.

mcognetta commented 2 years ago

Sorry, one more thing. It works if you repeatedly read from a stream:

julia> s = "{\"name\":\"Marco\"} {\"name\":\"Julia\"}"
"{\"name\":\"Marco\"} {\"name\":\"Julia\"}"

julia> stream = IOBuffer(s)
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=33, maxsize=Inf, ptr=1, mark=-1)

julia> JSON.parse(stream)
Dict{String, Any} with 1 entry:
  "name" => "Marco"

julia> JSON.parse(stream)
Dict{String, Any} with 1 entry:
  "name" => "Julia"

I will open a PR to add an example like this to the docs.