JuliaData / JSONTables.jl

JSON3.jl + Tables.jl
MIT License
67 stars 10 forks source link

Unexpected result on heterogeneous data #17

Closed laborg closed 3 years ago

laborg commented 4 years ago

I think it would be good to have a better story for heterogeneous data. Both of the following results (which are generated from the same data but where entries are ordered differently) are surprising and can cause problems.

julia> using JSONTables, DataFrames

julia> json_a = """[
       {"timea": 1585154193000, "troublemaker": 97},
       {"timea": 1310044361000}
       ]""";

julia> json_b = """[
       {"timea": 1310044361000},
       {"timea": 1585154193000,"troublemaker": 97}
       ]""";

julia> DataFrame(jsontable(json_a)) # throws error
ERROR: KeyError: key :troublemaker not found
Stacktrace:
 [1] get(::JSON3.Object{Base.CodeUnits{UInt8,String},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}, ::Symbol) at /home/gerhard/.julia/packages/JSON3/YGLA7/src/JSON3.jl:53
...

julia> DataFrame(jsontable(json_b)) # looses troublemaker silently
2×1 DataFrame
│ Row │ timea         │
│     │ Int64         │
├─────┼───────────────┤
│ 1   │ 1310044361000 │
│ 2   │ 1585154193000 │

What I would have expected jsontable to produce:

julia> using JSON3

julia> reduce((x, y) -> append!(x, y;cols=:union), JSON3.read(json_a);init=DataFrame())
2×2 DataFrame
│ Row │ timea         │ troublemaker │
│     │ Int64         │ Int64?       │
├─────┼───────────────┼──────────────┤
│ 1   │ 1585154193000 │ 97           │
│ 2   │ 1310044361000 │ missing      │

julia> reduce((x, y) -> append!(x, y;cols=:union), JSON3.read(json_b);init=DataFrame())
2×2 DataFrame
│ Row │ timea         │ troublemaker │
│     │ Int64         │ Int64?       │
├─────┼───────────────┼──────────────┤
│ 1   │ 1310044361000 │ missing      │
│ 2   │ 1585154193000 │ 97           │

If this is not possible or desired at least the documentation should include a clear warning about what to expect.

Thx!

bkamins commented 4 years ago

This is exactly the reason why in DataFrames.jl we have introduced cols kwarg in push!. But as @laborg commented on Slack - using it is a bit cumbersome (you have read JSON row by row and do this push!). Using append! or vcat from DataFrames.jl does not help.

It would be good to have a better solution here. I think having cols=:union approach as the default is what typically users expect (other values allowed in :cols in DataFrames.jl are rarely needed).

quinnj commented 3 years ago

Ok, I've been thinking about this on and off for a while now, along with the best way to approach a solution (in Tables.jl, maybe TableOperations.jl, or in this package). Take a look at what I came up with here: https://github.com/JuliaData/JSONTables.jl/pull/18. In short, we implement the cols=:union behavior from DataFrames by doing a pass over the json data initially to accurately determine all the column names/types we are to expect when treating the json as a "table".