apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python][Parquet] Writing a parquet table fails with `segfault` #37747

Open slobodan-ilic opened 1 year ago

slobodan-ilic commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

When trying to export a portion of real survey data from our system, we encountered a segfault when invoking pq.write_table. Granted that there might be problem with data, the function still shouldn't fail in such a way, but rather with a message. Here's the code snippet that reproduces the issue, together with 4 relevant npy files. I wasn't able to reproduce it in an explicit manner by generating data manually (this is a slice of real data that produced this error).

Here's the code:

"""Repro the discovered bug in pq writer."""

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

if __name__ == "__main__":
    # --- Load data ---
    inds = np.load("inds.npy")
    subrefs = np.load("subrefs.npy")
    values = np.load("values.npy")
    offsets = np.load("offsets.npy")

    # --- Create table with single MapArray ---
    keys = pa.DictionaryArray.from_arrays(inds, subrefs)
    var = pa.MapArray.from_arrays(offsets, keys, values)
    tbl = pa.Table.from_arrays([var], names=["mapvar"])

    # --- Try to write table to a parquet file (fails with segfault) ---
    pq.write_table(tbl, "test.parquet")

The relevant npy files are zipped in the archive here attached:

data.zip

Lmk if you need any clarifications. I'll try generating a minimal example, but this is what I've got for now.

Component(s)

Parquet, Python

mapleFU commented 1 year ago

Which version of pyarrow are you using? I'm using 13.0 in my macos and the file generated successfully...(Though it's a bit large and writing the file takes severo seconds)

slobodan-ilic commented 1 year ago

I'm using pyarrow: 13.0.0, on MacOS Monterey 12.6, MacBook Pro 2018. I'll try it on couple of different machines and report back the results. Wanted to do that anyways, but you beat me to it @mapleFU .

mapleFU commented 1 year ago

Hope some detail infomation or reproduce method. I cannot help if not have enough info...

slobodan-ilic commented 1 year ago

Which version of pyarrow are you using? I'm using 13.0 in my macos and the file generated successfully...(Though it's a bit large and writing the file takes severo seconds)

I tried it on a different machine and it's not producing a segfault. Weird that I can only repro it on my personal machine. I'll try a couple more and get back to you. Can you let me know which OS you're using, and what's your python version (for which you tried the ☝️ example)?

My environment on which it failed was python 3.11.3, but also 3.10.6...

mapleFU commented 1 year ago

@slobodan-ilic I'm a little busy these days, sorry for late reply. when segfault happens, would you mind catch a stack or coredump file? It will helps a lot.

If still not find, I'll run a script and loop the write_table for multiple times this weekend. Currently I'm a little busy so don't have enough bindwidth to reproduce the problem...