apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.33k stars 3.48k forks source link

[Python] `pa.compute.is_null()` returns incorrect answer for dense union arrays and segfaults for dense union scalars #34315

Open dannygoldstein opened 1 year ago

dannygoldstein commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow version 11.0.0 and 10.0.1, if I create a dense array with some null elements, pa.compute.is_null() returns that they are not null. Repro:

import pyarrow as pa
types = pa.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], type=pa.int8())
value_offsets = pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0], type=pa.int32())
array1 = pa.array([1, 2, 3, 1, None, 3, None, None, None, None, 1])
array2 = pa.array(["b"])
dense = pa.UnionArray.from_dense(types, value_offsets, [array1, array2])
print(dense)

# <pyarrow.lib.UnionArray object at 0x285dfbc40>
# -- is_valid: all not null
# -- type_ids:   [
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     0,
#     1
#   ]
# -- value_offsets:   [
#     0,
#     1,
#     2,
#     3,
#     4,
#     5,
#     6,
#     7,
#     8,
#     9,
#     10,
#     0
#   ]
# -- child 0 type: int64
#   [
#     1,
#     2,
#     3,
#     1,
#     null,
#     3,
#     null,
#     null,
#     null,
#     null,
#     1
#   ]
# -- child 1 type: string
#   [
#     "b"
#   ]

Illustration of the first issue:

pa.compute.is_null(dense)
# expected: BooleanArray [false, false, false, false, true, false, true, true, true, true, false, false]
# actual: 
# <pyarrow.lib.BooleanArray object at 0x285dfbdc0>
# [
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false,
#   false
# ]

Illustration of the second issue: I do pa.compute.is_null() on a null element of the array, i get a segfault:

null_element = dense[4]
print(null_element)
# <pyarrow.UnionScalar: None>
pa.compute.is_null(null_element)
# Fatal Python error: Segmentation fault

Component(s)

Python

jorisvandenbossche commented 1 year ago

@dannygoldstein thanks for the report! That's indeed a bug. The problem is that for a union array, there is no top-level validity, but this is defined by the validity bitmaps of its child arrays. But the is_null kernel should take this into account, which doesn't seem to happen.

The problem is also that the null_count attribute is already wrong (and it might be that is_null is taking a shortcut because of that):

>>> dense.null_count
0
dannygoldstein commented 1 year ago

thanks for the quick response @jorisvandenbossche! and thanks also for all the great work on arrow. it is an awesome package :)

westonpace commented 1 year ago

Looking at the kernel it seems both problems are there. It does indeed shortcut based on null_count and, even if it didn't, there is no special logic for unions (it just grabs the validity bitmap).