[Python] Allow parsing more general JSON formats

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.52k stars 3.54k forks source link

I have JSON data where the columnar (line-delimited) part is in a data subkey:


{
  "metadata": {"name": "block1"},
  "data" : [
    {"a": 1, "b": 2.0, "c": "foo", "d": false},
    {"a": 4, "b": -5.5, "c": null, "d": true}
  ]
}

It would be good if the arrow JSON parser could allow specifying where the columnar data is stored.

Since the metadata is also important to me it would be even better if the rest of the JSON could be returned as a Python dict with the only the specified keys parsed as arrow tables - e.g.


>>> block1 = json.read_json(fn, tables=['data'])
>>> block1['data']
pyarrow.Table
a: int64
b: double
c: string
d: bool
>>> block1['metadata']
{'name': 'block1'}
>>> block1
{
  "metadata": {"name": "block1"},
  "data" : pyarrow.Table
}

Reporter: Dave Hirschfeld / @dhirschfeld

_{Note: This issue was originally created as ARROW-5568. Please see the migration documentation for further details.}

Joris Van den Bossche / @jorisvandenbossche:

I have JSON data where the columnar (line-delimited) part is in a data subkey:

Note that the data subpart is not line delimited, but a comma-delimited JSON array. So that's a first thing that would be good to support.

Some additional resources that might be useful: in pandas there are many formats supported, called "orients", see the overview table at http://pandas.pydata.org/pandas-docs/version/0.24/user_guide/io.html#reading-json (disclaimer: I don't know how common the different formats are, so it doesn't necessarily makes sense to copy them all from pandas).

One of the formats is the JSON Table Schema (https://frictionlessdata.io/specs/table-schema/), which is a json file with a 'metadata' and 'data' top-level keys, where the 'data' then consists of comma-delimited records (so very similar in structure as what @dhirschfeld showed above).

apache / arrow

[Python] Allow parsing more general JSON formats #22011