Open asfimport opened 3 years ago
Alessandro Molina / @amol-: I was able to reproduce the issue locally. I seem to get the abort/segfault only when arrow is built in debug mode by the way. Otherwise it seems to freeze waiting for some thread.
This is the mentioned exception
Traceback (most recent call last):
File "/home/amol/ARROW/tries/read.py", line 5, in <module>
json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)
In debug mode I also get those two extra errors
pure virtual method called
terminate called without an active exception
and the traceback I could get from gdb looks like
#4 0x00007ffff39a5567 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff39a62e5 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff5ed13f0 in arrow::json::ChunkedStructArrayBuilder::InsertChildren (this=0xb89ae0, block_index=0,
unconverted=...) at src/arrow/json/chunked_builder.cc:396
#7 0x00007ffff5ed0321 in arrow::json::ChunkedStructArrayBuilder::Insert (this=0xb89ae0, block_index=0,
unconverted=std::shared_ptr<arrow::Array> (use count 1, weak count 0) = {...})
at src/arrow/json/chunked_builder.cc:320
#8 0x00007ffff5f2ba61 in arrow::json::TableReaderImpl::ParseAndInsert (this=0xc489b0,
partial=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...},
completion=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...},
whole=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, block_index=0)
at src/arrow/json/reader.cc:158
#9 0x00007ffff5f2a331 in arrow::json::TableReaderImpl::Read()::{lambda()#1}::operator()() const (__closure=0xca6cb8)
at src/arrow/json/reader.cc:104
...
Jacob Wujciak / @assignUser: The same exception still happen in pyarrow 7.0.0
Hello,
I have a big JSON file (~300MB) with complex records (nested json, nested lists of jsons). When I try to read this with pyarrow I am getting a segmentation fault. I tried then couple of things from read options, please see the code below (I developed this code on an example file that was attached here: https://github.com/apache/arrow/issues/25674):
For both the example file and my file, this code will return the straddling object exception (or seg fault) once the file reach the block_size. Increasing the block_size will make the code fail later.
Then I tried, on my file, to put an explicit schema:
This works, which may suggest that this issue, and the issue of the linked JIRA issue, are only appearing when an explicit schema is not provided. Additionally the following code works as well:
The block_size is bigger than my file in this case. Is it possible that the schema is defined in the first block and then if the schema changes, I get a seg fault?
I cannot share my json file, however, I hope that someone could add some clarity on what I am seeing and maybe suggest a workaround.
Thank you, Guido
Reporter: Guido Muscioni
Related issues:
Note: This issue was originally created as ARROW-13314. Please see the migration documentation for further details.