Open Tom-Newton opened 5 months ago
Thanks for the report!
I was unable to reproduce with an IPC stream fully in python created from a pyarrow table.
Could you try creating an IPC file from the pyspark dataframe? (I don't know if pyspark provides the functionality for that) Or can you convert the pyspark dataframe to pyarrow first (not going through pandas), and then save it?
And something else you could try: does it still reproduce after a roundtrip to Parquet?
The quick attempt at reproducing this in pyarrow, but as you also observed that doesn't crash:
import pyarrow as pa
typ = pa.struct([
pa.field("id", pa.int64()),
pa.field("value", pa.list_(pa.list_(pa.list_(pa.list_(pa.list_(pa.float64()))))))
])
arr = pa.array([{"id": 0, "value": None}], typ)
arr.to_pandas()
Thanks for the suggestions. PySpark doesn't really support this but I can hack it to make it do this.
does it still reproduce after a roundtrip to Parquet?
No, after a roundtrip to parquet the problem no longer occurs. To test I modified this section of pyspark https://github.com/apache/spark/blob/v3.5.1/python/pyspark/sql/pandas/serializers.py#L323-L324
Could you try creating an IPC file from the pyspark dataframe? (I don't know if pyspark provides the functionality for that) Or can you convert the pyspark dataframe to pyarrow first (not going through pandas), and then save it?
I managed to hack something in
And now it can be reproduced with just pyarrow
.
with open("/tmp/arrow_stream", "rb") as read_file:
with pa.ipc.open_stream(read_file) as reader:
schema = reader.schema
for batch in reader:
[c.to_pandas(date_as_object=True) for c in record_batches_to_chunked_array(batch)]
def record_batches_to_chunked_array(batch):
for c in pa.Table.from_batches([batch]).itercolumns():
yield c
arrow_stream.txt (the .txt
extension is just so Github lets me upload it, its actually a binary dump of the arrow stream created with the hack I mentioned above).
Wait.... there is something else weird going. This smaller reproduce doesn't always work depending on the python environment.
So it turns out the bug is also only reproducible when numpy
is imported prior to pyarrow
. So my current smallest reproduce is
import numpy as np
import pyarrow as pa
with open("/tmp/arrow_stream", "rb") as read_file:
with pa.ipc.open_stream(read_file) as reader:
schema = reader.schema
for batch in reader:
batch.to_pandas()
print("SUCCESS")
Where the file is as in my previous comment arrow_stream.txt
I managed to attach the a debugger so I can see a bit about why its segfaulting.
Ultimately the segfault is on arrow/array/array_nested.h:90
. Suspiciously the value of data_->offset
is -189153672
, in the call to value_offset
that seg faults. All the previous calls had data_->offset
is 0
. -189153672
is a deterministic value when I run with the arrow_stream
file I got out of spark. Interestingly we still get this -189153672
value without the numpy
import but without numpy it does not segfault.
For an IPC file created from python data_->offset
is always 0
and there is no segfault. So, potentially the problem is actually on the Java side.
Thanks, I can now reproduce it as well!
I think you are right in the observation that this seems a problem with the data generated on the Java/Spark side (although it is still strange it segfaults or not depending on numpy being imported first or not)
When reading your IPC stream file without converting to pandas and inspecting the data, we can see that it is indeed invalid data:
import pyarrow as pa
with pa.ipc.open_stream("../Downloads/arrow_stream.txt") as reader:
batch = reader.read_next_batch()
arr = batch["value"]
>>> arr
<pyarrow.lib.ListArray object at 0x7fe352a8d1e0>
[
null,
null,
null
]
# Validating / inspecting the parent array
>>> arr.validate(full=True)
>>> arr.offsets
<pyarrow.lib.Int32Array object at 0x7fe29daa0460>
[
0,
0,
0,
0
]
>>> arr.values
<pyarrow.lib.ListArray object at 0x7fe29d84c0a0>
[]
# Validating / inspecting the first child array
>>> arr.values.validate(full=True)
>>> arr.values.offsets
<pyarrow.lib.Int32Array object at 0x7fe29f3ef880>
<Invalid array: Buffer #1 too small in array of type int32 and length 1: expected at least 4 byte(s), got 0
>>> arr.values.values
<pyarrow.lib.ListArray object at 0x7fe29f238760>
[]
So the offsets of the child array are missing. This child array has a length of 0, but following the format the offsets still need to have length of 1. I seem to remember this is a case that has come up before.
A different way to inspect the data using nanoarrow (using the arr
from the code example above), where we can also see that each of the list child arrays has a "data_offset" buffer of 0 bytes:
>>> import nanoarrow as na
>>> na.array(arr).inspect()
<ArrowArray list<element: list<element: list<element: list<element: l>
- length: 3
- offset: 0
- null_count: 3
- buffers[2]:
- validity <bool[1 b] 00000000>
- data_offset <int32[16 b] 0 0 0 0>
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: list<element: list<elemen>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: list<element: double>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: list<element: double>>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray list<element: double>>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data_offset <int32[0 b] >
- dictionary: NULL
- children[1]:
'element': <ArrowArray double>
- length: 0
- offset: 0
- null_count: 0
- buffers[2]:
- validity <bool[0 b] >
- data <double[0 b] >
- dictionary: NULL
- children[0]:
Related issue: https://github.com/apache/arrow/issues/31396
And a more recent issue: https://github.com/apache/arrow/issues/40038, with a fix in Arrow Java 16.0 for this (at least for strings, not sure if the lists were automatically fixed as well). Although the discussion was only about the C Data Interface, so for IPC they might still do the same as before (didn't check the PR in detail)
And a more recent issue: #40038, with a fix in Arrow Java 16.0 for this (at least for strings, not sure if the lists were automatically fixed as well).
I did try building a custom pyspark
based on arrow 16.0.0 on the Java side. Unfortunately it still fails in the same way. In fact the resulting stream was identical to the official pyspark
release which uses arrow 12.0.1
on the java side.
I will probably try to create a minimal reproducer for the bad IPC file with java arrow and without spark.
We are having some discussion about this on Zulip chat, and the conclusion might be that the C++ library is generally forgiving about this and accepts it as input. But so if it does that, we should of course ensure we handle it properly in all cases internally and don't crash on such data (or at least error properly complaining about invalid data instead of segfaulting). Can try to take a look at fixing it in the pyarrow->pandas conversion.
Describe the bug, including details regarding any error messages, version, and platform.
So far I've only been able to reproduce this case with
pyspark
but I think the bug is probably on the arrow side. The problem was introduced with https://github.com/apache/arrow/pull/15210 and reverting this change still fixes the problem on the 16.0.0 release.Reproduce
The smallest reproducer I've found is the following. reproduce_pyspark.py.txt (it has a
.txt
extensions because github doesn't let me upload.py
) Versions:Error is:
full_stdout.txt
A few things I've noticed:
to_pandas
on it.Component(s)
C++, Python, Java