apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.54k forks source link

[Python] Allow parsing more general JSON formats #22011

Open asfimport opened 5 years ago

asfimport commented 5 years ago

I have JSON data where the columnar (line-delimited) part is in a data subkey:


{
  "metadata": {"name": "block1"},
  "data" : [
    {"a": 1, "b": 2.0, "c": "foo", "d": false},
    {"a": 4, "b": -5.5, "c": null, "d": true}
  ]
}

 

 

It would be good if the arrow JSON parser could allow specifying where the columnar data is stored.

Since the metadata is also important to me it would be even better if the rest of the JSON could be returned as a Python dict with the only the specified keys parsed as arrow tables - e.g.

 


>>> block1 = json.read_json(fn, tables=['data'])
>>> block1['data']
pyarrow.Table
a: int64
b: double
c: string
d: bool
>>> block1['metadata']
{'name': 'block1'}
>>> block1
{
  "metadata": {"name": "block1"},
  "data" : pyarrow.Table
}

 

 

Reporter: Dave Hirschfeld / @dhirschfeld

Note: This issue was originally created as ARROW-5568. Please see the migration documentation for further details.

asfimport commented 5 years ago

Joris Van den Bossche / @jorisvandenbossche:

I have JSON data where the columnar (line-delimited) part is in a data subkey:

Note that the data subpart is not line delimited, but a comma-delimited JSON array. So that's a first thing that would be good to support.

Some additional resources that might be useful: in pandas there are many formats supported, called "orients", see the overview table at http://pandas.pydata.org/pandas-docs/version/0.24/user_guide/io.html#reading-json (disclaimer: I don't know how common the different formats are, so it doesn't necessarily makes sense to copy them all from pandas).

One of the formats is the JSON Table Schema (https://frictionlessdata.io/specs/table-schema/), which is a json file with a 'metadata' and 'data' top-level keys, where the 'data' then consists of comma-delimited records (so very similar in structure as what @dhirschfeld showed above).