apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
151 stars 34 forks source link

fix(python): Make shallow CArray copies less shallow to accomodate moving children #451

Closed paleolimbot closed 2 months ago

paleolimbot commented 2 months ago

This PR updates the logic that creates a "shallow copy" of an ArrowArray. Before, it simply made a shallow copy of the outer array, which works in most cases. However, the spec allows for "moving" child arrays, which means that the guarantee of a valid outer array does not guarantee anything about the validity of the ArrowArray* child pointers. It's difficult, but possible, to trigger this in Python (below); however, I think this sort of code is common for importers (because it allows the lifecycle of columns in an array to be independent).

import nanoarrow as na
import pyarrow as pa

# Given some array
array = na.c_array_from_buffers(
    na.struct({"col1": na.int32()}),
    3,
    [None],
    children=[na.c_array([1, 2, 3], na.int32())]
)

user_array = na.Array(array)
user_array
#> nanoarrow.Array<int32>[3]
#> {'col1': 1}
#> {'col1': 2}
#> {'col1': 3}

# totally valid shallow copy of this array
schema_capsule, array_capsule = array.__arrow_c_array__()
array2 = na.c_array(array_capsule, schema_capsule)

# A consumer is technically allowed to move a child array
x = pa.Array._import_from_c(array2.child(0)._addr(), pa.int32())
del array
del x

# With the previous shallow copy implementation, this could segfault or fail
list(user_array.iter_py())
# Instead of a segfault I tend to get:
#> NanoarrowException: ArrowBasicArrayStreamValidate() failed (22): Expected int32 array buffer 1 to have size >= 12 bytes but found buffer with 0 bytes

After this PR the original array remains valid:

list(user_array.iter_py())
#> [{'col1': 1}, {'col1': 2}, {'col1': 3}]

This does, however, have a non-trivial effect when making a shallow copy where a lot of children are involved:

import nanoarrow as na
import pyarrow as pa

n_col = int(1e3)
n_items = int(1e5)

array_wide = na.c_array_from_buffers(
    na.struct({f"col{i}": na.int32() for i in range(n_col)}),
    n_items,
    [None],
    children=[na.c_array(range(n_items), na.int32())] * n_col
)

%timeit pa.array(array_wide)
#> Before this PR
#> 595 µs ± 5.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
#> After this Pr
#> 828 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)