apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.21k stars 3.46k forks source link

pyarrow.RecordBatch.from_pandas fails on concatenated pandas.DataFrames: TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array #41936

Open PatrikBernhard opened 3 months ago

PatrikBernhard commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

>>> import pandas as pd
>>> pd.__version__ 
'2.2.2'
>>> import pyarrow as pa
>>> pa.__version__
'16.1.0'
>>> df = pd.DataFrame({"1": [1]}).astype({"1": "int32[pyarrow]"})
>>> concat_df = pd.concat([df, df])
>>> pa.RecordBatch.from_pandas(df)
pyarrow.RecordBatch
1: int32
----
1: [1]

>>> pa.RecordBatch.from_pandas(concat_df)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3289, in pyarrow.lib.RecordBatch.from_pandas
  File "pyarrow/table.pxi", line 3379, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

This has been observed on my macOS Sonoma 14.5 as well as on a server running linux.

Component(s)

Python

jorisvandenbossche commented 2 months ago

@PatrikBernhard the issue here is that your example pandas DataFrame consists of a chunked column (because of the concat step), and a RecordBatch is a data structure where each column consists of a single contiguous array.

In pyarrow, that's the difference between a RecordBatch and a Table (RecordBatch being a collection of Array objects, and a Table a collection of ChunkedArray objects).

So you will noticed that pa.Table.from_pandas(concat_df) works fine. Historically, pandas DataFrames always had columns that used a single non-chunked array under the hood, and that's the reason that RecordBatch.from_pandas currently does not support that.

I am not entirely sure what the best solution is: keep raising the error (but maybe make it more informative or document this behaviour better) because people might not expect a copy in this conversion step, or automatically converting the chunked array to a contiguous array.

As a comparison, directly constructing a RecordBatch from a ChunkedArray gives the same error:

In [10]: arr = pa.chunked_array([pa.array([1], pa.int32()), pa.array([2], pa.int32())])

In [11]: pa.RecordBatch.from_arrays([arr], names=["col"])
...
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array