apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python] Table contains dictionary encoding with negative index #37732

Open zbs opened 1 year ago

zbs commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

OS: RHEL 8.3 Language: Python 3.8.15 PyArrow Version: 9.0.0

I have a script that streams data to a file using pa.ipc.new_stream(path, reader.schema), and in a separate process, reading from that file using:

                with pa.ipc.open_stream(path) as reader:
                    batches = list(reader)
                    df = pa.Table.from_batches(batches).to_pandas()

which fails with

pyarrow.lib.ArrowIndexError: Index -241 out of bounds

Inexplicably, in some cases it looks like, in a specific column with a dictionary encoding, one of the rows has a dictionary index of -241, even though the dictionary itself consists of a single value. Is this standard file corruption, a bug, or am I doing something wrong in the process writing/reading described above?

Component(s)

Python

kou commented 1 year ago

Could you provide a runnable script to reproduce the problem?

zbs commented 1 year ago

Unfortunately I don't have a simple reproducer, since this is hooked up to live data streams and other large internal data sources. I can paste my writer and reader code.

In writer script:

writer_exit_stack = ExitStack()
id_to_writer = {}
with writer_exit_stack:
    # data_reader is effectively a `pa.ipc.open_stream(obj)`
    for data_batch in data_reader:
        if len(data_batch) == 0:
            continue
        for id in pc.unique(data_batch["id"]):
            batch = data_batch.filter(pc.equal(data_batch["id"], id))
            path = f"foo_{id}.arrow"

            if id not in id_to_writer:
                writer = pa.ipc.new_stream(path, data_reader.schema)
                id_to_writer[id] = writer
                writer_exit_stack.enter_context(writer)
            writer = id_to_writer[id]
            writer.write_batch(batch)

In reader script:

with pa.ipc.open_stream(path) as reader:
    batch = reader.read_all()
    df = batch.to_pandas()

Some notes from debugging:

liujiajun commented 8 months ago

Hey we encountered the similar issue and i suspect it has something to do with dictionary builder with adaptive int builder, which produces negative indexes. Anyone can help with this?

kou commented 8 months ago

Could you provide a runnable script to reproduce the problem?