apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

if_else bug when using chunked string array with offsets #41479

Open 0x26res opened 6 months ago

0x26res commented 6 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I have a chunked array made of view/slices of the same array.

When I call if_else on that array, the results are wrong and it can results in strings that are not valid utf-8.

import random

import pyarrow as pa
import pyarrow.compute as pc

sizes = [131072, 57066]
values = ['FOO', "BAR", "HELLO", "WOLRD", ""]

data = pa.array([random.choice(values) for _ in range(sum(sizes))])

inputs = pa.chunked_array(
    [
        data[:sizes[0]],
        data[sizes[1]:]
    ]
)

results = pc.if_else(
    pc.equal(inputs, ""),
    pa.scalar(None, pa.string()),
    inputs,
)

print(pc.unique(results).sort().to_pylist())
# this returns corrupted data, eg:  ['\x00\x00\x00',  '\x00\x00\x00\x00\x00', 'BAR', ...]

For context, I'm loading data from a parquet file, and replacing empty strings with nulls. This started happening when the size of the parquet file increased and data was chunked.

I've tested with pyarrow==16.0.0

Component(s)

C++, Python

k-ishizaka commented 1 month ago

I have encountered the same problem.

I think this problem is caused by the shortcut in if_else kernel when left is invalid. https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/compute/kernels/scalar_if_else.cc#L740-L750

This shortcut is not safe when right is a chunk of other larger array. In this case, offset of right might starts from the middle of larger array. Because this shortcut copies value of right to newly allocated value of output, offset of output should start from zero, but just a copy of right is used.