if_else bug when using chunked string array with offsets

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.5k stars 3.53k forks source link

Describe the bug, including details regarding any error messages, version, and platform.

I have a chunked array made of view/slices of the same array.

When I call if_else on that array, the results are wrong and it can results in strings that are not valid utf-8.

import random

import pyarrow as pa
import pyarrow.compute as pc

sizes = [131072, 57066]
values = ['FOO', "BAR", "HELLO", "WOLRD", ""]

data = pa.array([random.choice(values) for _ in range(sum(sizes))])

inputs = pa.chunked_array(
    [
        data[:sizes[0]],
        data[sizes[1]:]
    ]
)

results = pc.if_else(
    pc.equal(inputs, ""),
    pa.scalar(None, pa.string()),
    inputs,
)

print(pc.unique(results).sort().to_pylist())
# this returns corrupted data, eg:  ['\x00\x00\x00',  '\x00\x00\x00\x00\x00', 'BAR', ...]

For context, I'm loading data from a parquet file, and replacing empty strings with nulls. This started happening when the size of the parquet file increased and data was chunked.

I've tested with pyarrow==16.0.0

Component(s)

C++, Python

apache / arrow

if_else bug when using chunked string array with offsets #41479

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)