Open zbs opened 1 year ago
Could you provide a runnable script to reproduce the problem?
Unfortunately I don't have a simple reproducer, since this is hooked up to live data streams and other large internal data sources. I can paste my writer and reader code.
In writer script:
writer_exit_stack = ExitStack()
id_to_writer = {}
with writer_exit_stack:
# data_reader is effectively a `pa.ipc.open_stream(obj)`
for data_batch in data_reader:
if len(data_batch) == 0:
continue
for id in pc.unique(data_batch["id"]):
batch = data_batch.filter(pc.equal(data_batch["id"], id))
path = f"foo_{id}.arrow"
if id not in id_to_writer:
writer = pa.ipc.new_stream(path, data_reader.schema)
id_to_writer[id] = writer
writer_exit_stack.enter_context(writer)
writer = id_to_writer[id]
writer.write_batch(batch)
In reader script:
with pa.ipc.open_stream(path) as reader:
batch = reader.read_all()
df = batch.to_pandas()
Some notes from debugging:
Hey we encountered the similar issue and i suspect it has something to do with dictionary builder with adaptive int builder, which produces negative indexes. Anyone can help with this?
Could you provide a runnable script to reproduce the problem?
Describe the bug, including details regarding any error messages, version, and platform.
OS: RHEL 8.3 Language: Python 3.8.15 PyArrow Version: 9.0.0
I have a script that streams data to a file using
pa.ipc.new_stream(path, reader.schema)
, and in a separate process, reading from that file using:which fails with
Inexplicably, in some cases it looks like, in a specific column with a dictionary encoding, one of the rows has a dictionary index of -241, even though the dictionary itself consists of a single value. Is this standard file corruption, a bug, or am I doing something wrong in the process writing/reading described above?
Component(s)
Python