[Python] Table contains dictionary encoding with negative index

zbs commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

OS: RHEL 8.3 Language: Python 3.8.15 PyArrow Version: 9.0.0

I have a script that streams data to a file using pa.ipc.new_stream(path, reader.schema), and in a separate process, reading from that file using:

                with pa.ipc.open_stream(path) as reader:
                    batches = list(reader)
                    df = pa.Table.from_batches(batches).to_pandas()

which fails with

pyarrow.lib.ArrowIndexError: Index -241 out of bounds

Inexplicably, in some cases it looks like, in a specific column with a dictionary encoding, one of the rows has a dictionary index of -241, even though the dictionary itself consists of a single value. Is this standard file corruption, a bug, or am I doing something wrong in the process writing/reading described above?

Component(s)

Python

kou commented 1 year ago

Could you provide a runnable script to reproduce the problem?

zbs commented 1 year ago

Unfortunately I don't have a simple reproducer, since this is hooked up to live data streams and other large internal data sources. I can paste my writer and reader code.

In writer script:

writer_exit_stack = ExitStack()
id_to_writer = {}
with writer_exit_stack:
    # data_reader is effectively a `pa.ipc.open_stream(obj)`
    for data_batch in data_reader:
        if len(data_batch) == 0:
            continue
        for id in pc.unique(data_batch["id"]):
            batch = data_batch.filter(pc.equal(data_batch["id"], id))
            path = f"foo_{id}.arrow"

            if id not in id_to_writer:
                writer = pa.ipc.new_stream(path, data_reader.schema)
                id_to_writer[id] = writer
                writer_exit_stack.enter_context(writer)
            writer = id_to_writer[id]
            writer.write_batch(batch)

In reader script:

with pa.ipc.open_stream(path) as reader:
    batch = reader.read_all()
    df = batch.to_pandas()

Some notes from debugging:

The writer script was running with a four hour time allotment. At the end of that allotment, the process was killed. I noticed that the paths which contained the bad indices all had timestamps equalling the time when the process was killed. I did add the writers to the exit stack, but is it possible that some sort of finalization was pre-empted, and the data is resultingly corrupt?
Across files with the same schema, the issue occurred in different columns. The first instance of this was in column A, the second in column Z, etc.

liujiajun commented 8 months ago

Hey we encountered the similar issue and i suspect it has something to do with dictionary builder with adaptive int builder, which produces negative indexes. Anyone can help with this?

kou commented 8 months ago

Could you provide a runnable script to reproduce the problem?

apache / arrow

[Python] Table contains dictionary encoding with negative index #37732

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)