apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.14k stars 3.45k forks source link

pyarrow.parquet.read_table("parquet_file") causes bus error in ipython #31185

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I have a parquet file with two columns (int64 and double) and 9 million rows. The parquet tools (parquet, parquet-reader, parquet-schema...) read it perfectly. (I have many files, actually, but they all exhibit the same behavior).

The following code fails with "zsh bus error  ipython":

import pyarrow.parquet as pq pq.read_table("parquet_file")

These snippets work properly.

pq.read_table("parquet_file", use_lagacy_dataset=True)

f = pq.ParquetFile("parquet_file") f.read() for batch in f.iterbatches(): print(len(batch))

Environment: macOS 12.2.1 aarch64 python. 3.10.1 arrow 7.0.0 Reporter: Jay Edwards

Note: This issue was originally created as ARROW-15737. Please see the migration documentation for further details.

asfimport commented 2 years ago

Jay Edwards: This also happens with a clean python 3.9.9 environements

pyenv uninstall 3.9.9 pyenv install 3.9.9 pyenv local 3.9.9. pip install pyarrow ipython pyfzf nord-pygments

asfimport commented 2 years ago

Jay Edwards: I rebuilt the arrow libraries according to the instructions.

I used the ninja-release-python preset from the release-7.0.0 branch.

asfimport commented 2 years ago

Jay Edwards: I've found files that don't exhibit the behavior.

asfimport commented 2 years ago

David Li / @lidavidm: Are you able to attach a sample file that fails?