apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.66k stars 3.56k forks source link

[C++] JSON table writer #21529

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Users who need to emit json in line delimited format currently cannot do so using arrow. It should be straightforward to implement this efficiently, and it will be very helpful for testing and benchmarking

Reporter: Ben Kietzman / @bkietz

Related issues:

Note: This issue was originally created as ARROW-5033. Please see the migration documentation for further details.

asfimport commented 5 years ago

Wes McKinney / @wesm: As something to keep in mind, we will need to implement a "Sink" node type to be the flip side of "Scan" in a query engine context. To the user may wish to output the results of a query directly to CSV, JSON, Parquet or some other dataset format. So we need to develop a common API that this can hook into for this purpose

asfimport commented 2 years ago

Nicola Crane / @thisisnic: User request on StackOverflow for this feature to be implemented: https://stackoverflow.com/questions/71047976/fast-ldjson-writing-with-arrow

asfimport commented 2 years ago

Weston Pace / @westonpace: I came across an helpful Github issue today that explains that there are actually several standards for line delimited JSON and goes over a bit the differences. This might be a helpful reference when this gets implemented: https://github.com/ndjson/ndjson.github.io/issues/1

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport commented 2 years ago

Steve M. Kim: As part of this feature request, do we contemplate generating a JSON Schema from a Arrow table schema? Given an Arrow schema and record batches, it would be useful to get a JSON schema and a sequence of JSON objects that conform to that schema. This would also facilitate testing the correctness of the Arrow JSON writer.

asfimport commented 2 years ago

David Li / @lidavidm: That's a new can of worms :) There's been some discussion about a way to represent Arrow schemas in JSON. See https://github.com/apache/arrow/issues/13803 and https://github.com/apache/arrow/pull/7110 and ARROW-8952.