Open PatrikBernhard opened 3 months ago
@PatrikBernhard the issue here is that your example pandas DataFrame consists of a chunked column (because of the concat step), and a RecordBatch is a data structure where each column consists of a single contiguous array.
In pyarrow, that's the difference between a RecordBatch
and a Table
(RecordBatch being a collection of Array
objects, and a Table a collection of ChunkedArray
objects).
So you will noticed that pa.Table.from_pandas(concat_df)
works fine.
Historically, pandas DataFrames always had columns that used a single non-chunked array under the hood, and that's the reason that RecordBatch.from_pandas
currently does not support that.
I am not entirely sure what the best solution is: keep raising the error (but maybe make it more informative or document this behaviour better) because people might not expect a copy in this conversion step, or automatically converting the chunked array to a contiguous array.
As a comparison, directly constructing a RecordBatch from a ChunkedArray gives the same error:
In [10]: arr = pa.chunked_array([pa.array([1], pa.int32()), pa.array([2], pa.int32())])
In [11]: pa.RecordBatch.from_arrays([arr], names=["col"])
...
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
Describe the bug, including details regarding any error messages, version, and platform.
This has been observed on my macOS Sonoma 14.5 as well as on a server running linux.
Component(s)
Python