apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.57k stars 3.54k forks source link

Fields within a null struct are not initialized with null values #41833

Open timsaucer opened 5 months ago

timsaucer commented 5 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When creating an array from a python dict, field entries of a null struct are initialized with default values rather than null even if their field is nullable. In the minimal example below, you would expect the 3rd row to have values of inner_1 and inner_2 to be null.

import pyarrow as pa

print(pa.array([
    {"outer": {"inner_1": 1, "inner_2": 2}},
    {"outer": {"inner_1": 3, "inner_2": None}},
    {"outer": None},
]))

Generates the following output:

-- is_valid: all not null
-- child 0 type: struct<inner_1: int64, inner_2: int64>
  -- is_valid:
      [
      true,
      true,
      false
    ]
  -- child 0 type: int64
    [
      1,
      3,
      0
    ]
  -- child 1 type: int64
    [
      2,
      null,
      0
    ]

Component(s)

Python

jorisvandenbossche commented 5 months ago

@timsaucer I see what you mean, but as far as I know, nothing in the Arrow columnar format specification requires that those values are null.

In the end, also for a primitive array with a null, we actually put some "default" value in the null slot:

>>> arr = pa.array([1, None, 3])
>>> arr
<pyarrow.lib.Int64Array object at 0x7f23782d5360>
[
  1,
  null,
  3
]

# using nanoarrow to more easily view the actual buffers
>>> import nanoarrow as na
>>> na.array(arr).inspect()
<ArrowArray int64>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10100000>
  - data <int64[24 b] 1 0 3>    # <-- looking at the actual data buffer, the null slot is also filled with 0
- dictionary: NULL
- children[0]:

Similarly, in the nested struct case, those default values in the child array are masked by the validity of the parent struct array. I know it is not exactly the same given I am comparing a buffer with a child array, but the principle is the same: the null is determined by the validity bitmap, and that at point the underlying value (whether this is a buffer slot or a child arrays slot) can be any value.

While you could argue that for specifically this kind of conversion of python objects to Arrow data, we could put a null in the child array as well (although that would require to allocate an additional validity bitmap in this small example case), other code should never assume this is the case, as you can easily create a StructArray in a different way (eg directly from the child arrays and a validity bitmap) that would also not give this guarantee.

jorisvandenbossche commented 5 months ago

Note that if it is about accessing that subfield of a struct array: at that point you indeed typically (although depending on the exact use case) want to "propagate" the parent struct null values to child field as well.

For that reason, pyarrow provides two separate APIs to get the child array (using your original example as arr):

# getting the "raw" child array as stored under the hood
>>> arr.field("outer").field("inner_1")
Out[14]: 
<pyarrow.lib.Int64Array object at 0x7f23734339a0>
[
  1,
  3,
  0
]

# getting the "logical" child array
>>> pc.struct_field(arr, ["outer", "inner_1"])
Out[20]: 
<pyarrow.lib.Int64Array object at 0x7f237276b3a0>
[
  1,
  3,
  null
]

This API is far from ideal. On the C++ side, there is a StructArray::GetFlattenedField that gives you this logical, "flattened" version with nulls propagated. I personally think the most logical thing to do as a user ('.field(..)`) should give you what most users expect (and is safest), i.e. the flattened version with nulls propagated. See https://github.com/apache/arrow/issues/14970 for this.

I assume that on the datafusion side, there should also be some distinction between those two ways to get a field.