apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.15k stars 3.45k forks source link

[Python] pyarrow.json.read_json when read indent json file with report error #40912

Open FlyTOmeLight opened 5 months ago

FlyTOmeLight commented 5 months ago

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow version: 14.0.2

    pajson.read_json("indent.json")
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to string in row 0
import pyarrow.json as pajson
pajson.read_json("indent.json")

when i write indent.json, i use json.dump(raw_data, fp, ensure_ascii=False, indent=4) and then i use pajson.read_json, that bug will be report, i wonder know how to fix it. here is my wrong json. wrong.json

Component(s)

Python

martsec commented 4 months ago

As far as I am aware, Arrow only supports to read line-delimited JSON files (see docs and note)

Though there it seems to be a couple options that could help with reading your json https://arrow.apache.org/docs/python/generated/pyarrow.json.ParseOptions.html#pyarrow.json.ParseOptions

newlines_in_valuesbool, optional (default False)

Whether objects may be printed across multiple lines (for example pretty printed). If false, input must end with an empty line.