pyarrow.RecordBatch.from_pandas fails on concatenated pandas.DataFrames: TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache License 2.0

14.21k stars 3.46k forks source link

>>> import pandas as pd >>> pd.__version__ '2.2.2' >>> import pyarrow as pa >>> pa.__version__ '16.1.0' >>> df = pd.DataFrame({"1": [1]}).astype({"1": "int32[pyarrow]"}) >>> concat_df = pd.concat([df, df]) >>> pa.RecordBatch.from_pandas(df) pyarrow.RecordBatch 1: int32 ---- 1: [1] >>> pa.RecordBatch.from_pandas(concat_df) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 3289, in pyarrow.lib.RecordBatch.from_pandas File "pyarrow/table.pxi", line 3379, in pyarrow.lib.RecordBatch.from_arrays TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

@PatrikBernhard the issue here is that your example pandas DataFrame consists of a chunked column (because of the concat step), and a RecordBatch is a data structure where each column consists of a single contiguous array.

In pyarrow, that's the difference between a RecordBatch and a Table (RecordBatch being a collection of Array objects, and a Table a collection of ChunkedArray objects).

So you will noticed that pa.Table.from_pandas(concat_df) works fine. Historically, pandas DataFrames always had columns that used a single non-chunked array under the hood, and that's the reason that RecordBatch.from_pandas currently does not support that.

I am not entirely sure what the best solution is: keep raising the error (but maybe make it more informative or document this behaviour better) because people might not expect a copy in this conversion step, or automatically converting the chunked array to a contiguous array.

As a comparison, directly constructing a RecordBatch from a ChunkedArray gives the same error:

In [10]: arr = pa.chunked_array([pa.array([1], pa.int32()), pa.array([2], pa.int32())])

In [11]: pa.RecordBatch.from_arrays([arr], names=["col"])
...
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

apache / arrow

pyarrow.RecordBatch.from_pandas fails on concatenated pandas.DataFrames: TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array #41936

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)