apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[Python] segfault for Table.join with sliced large_string array #39415

Closed lukemanley closed 9 months ago

lukemanley commented 10 months ago

Describe the bug, including details regarding any error messages, version, and platform.

The following code segfaults. Changing the type from large_string to string and/or removing the slice (when creating t2) seems to avoid the segfault.

import pyarrow as pa

arr = pa.array(list("ABC"), pa.large_string())
t1 = pa.table({"key": arr})
t2 = pa.table({"key": arr[1:]})
t1.join(t2, "key", join_type="inner")  # segfault

Component(s)

Python

kevinmingtarja commented 10 months ago

Hi @lukemanley, just curious, what version and platform are you running on?

I tried to reproduce this using pyarrow-14.0.2 on both my Apple M1 Mac and an amd64 Ubuntu 22.04 (EC2 VM), but couldn't. I did try it as well on google colab and the session kept on crashing, which may be because of the segfault.

Python 3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> arr = pa.array(list("ABC"), pa.large_string())
>>> t1 = pa.table({"key": arr})
>>> t2 = pa.table({"key": arr[1:]})
>>> t1.join(t2, "key", join_type="inner")
pyarrow.Table
key: large_string
----
key: [["B","C"]]
lukemanley commented 9 months ago

hmmm. I can no longer reproduce. Might have been an earlier version of pyarrow. I will close this.