apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.67k stars 3.56k forks source link

[Python] JSON parsing segment fault on long records (block_size) dependent #28990

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Hello,

 

I have a big JSON file (~300MB) with complex records (nested json, nested lists of jsons). When I try to read this with pyarrow I am getting a segmentation fault. I tried then couple of things from read options, please see the code below (I developed this code on an example file that was attached here: https://github.com/apache/arrow/issues/25674):

 


    from pyarrow import json
    from pyarrow.json import ReadOptions
    import tqdm

    if __name__ == '__main__':

         source = 'wiki_04.jsonl'

         ro = ReadOptions(block_size=2**20)

         with open(source, 'r') as file:
             for i, line in tqdm.tqdm(enumerate(file)):
                 with open('temp_file_arrow_3.ndjson', 'a') as file2:
                     file2.write(line)
                 json.read_json('temp_file_arrow_3.ndjson', read_options=ro)

For both the example file and my file, this code will return the straddling object exception (or seg fault) once the file reach the block_size. Increasing the block_size will make the code fail later.

Then I tried, on my file, to put an explicit schema:


    from pyarrow import json
    from pyarrow.json import ReadOptions
    import pandas as pd

    if __name__ == '__main__':

         source = 'my_file.jsonl'

         df = pd.read_json(source, lines=True) 
         table_schema = pa.Table.from_pandas(df).schema

         ro = ReadOptions(explicit_schema = table_schema)
         table = json.read_json(source, read_options=ro)         

This works, which may suggest that this issue, and the issue of the linked JIRA issue, are only appearing when an explicit schema is not provided. Additionally the following code works as well:


    from pyarrow import json
    from pyarrow.json import ReadOptions
    import pandas as pd

    if __name__ == '__main__':

         source = 'my_file.jsonl'

         ro = ReadOptions(block_size = 2**30)
         table = json.read_json(source, read_options=ro)         

The block_size is bigger than my file in this case. Is it possible that the schema is defined in the first block and then if the schema changes, I get a seg fault?

I cannot share my json file, however, I hope that someone could add some clarity on what I am seeing and maybe suggest a workaround.

Thank you, Guido

Reporter: Guido Muscioni

Related issues:

Note: This issue was originally created as ARROW-13314. Please see the migration documentation for further details.

asfimport commented 3 years ago

Alessandro Molina / @amol-: I was able to reproduce the issue locally. I seem to get the abort/segfault only when arrow is built in debug mode by the way. Otherwise it seems to freeze waiting for some thread.

This is the mentioned exception


Traceback (most recent call last):
  File "/home/amol/ARROW/tries/read.py", line 5, in <module>
    json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
  File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

In debug mode I also get those two extra errors


pure virtual method called
terminate called without an active exception

and the traceback I could get from gdb looks like


#4  0x00007ffff39a5567 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff39a62e5 in __cxa_pure_virtual () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff5ed13f0 in arrow::json::ChunkedStructArrayBuilder::InsertChildren (this=0xb89ae0, block_index=0, 
    unconverted=...) at src/arrow/json/chunked_builder.cc:396
#7  0x00007ffff5ed0321 in arrow::json::ChunkedStructArrayBuilder::Insert (this=0xb89ae0, block_index=0, 
    unconverted=std::shared_ptr<arrow::Array> (use count 1, weak count 0) = {...})
    at src/arrow/json/chunked_builder.cc:320
#8  0x00007ffff5f2ba61 in arrow::json::TableReaderImpl::ParseAndInsert (this=0xc489b0, 
    partial=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
    completion=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, 
    whole=std::shared_ptr<arrow::Buffer> (use count 1, weak count 0) = {...}, block_index=0)
    at src/arrow/json/reader.cc:158
#9  0x00007ffff5f2a331 in arrow::json::TableReaderImpl::Read()::{lambda()#1}::operator()() const (__closure=0xca6cb8)
    at src/arrow/json/reader.cc:104
...
asfimport commented 2 years ago

Jacob Wujciak / @assignUser: The same exception still happen in pyarrow 7.0.0