apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.89k stars 3.38k forks source link

[C++] [Python] Does a sliced StructArray roundtrip on c data interface? #29951

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I am struggling to roundtrip a sliced StructArray over the c data interface.

Consider the array:


fields = [
            ("f1", pyarrow.int32()),
            ("f2", pyarrow.string()),
        ]
        a = pyarrow.array(
            [
                {"f1": 1, "f2": "a"},
                None,
                {"f1": 3, "f2": None},
                {"f1": None, "f2": "d"},
                {"f1": None, "f2": None},
            ],
            pyarrow.struct(fields),
        ).slice(1, 2)

When reading this array from the c data interface, I get:


array: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00007f61796091c0,
    children: 0x00007f6179609280,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f617aef2ba0,
    ),
    private_data: 0x00007f617960b3c0,
}

child #0: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00007f0f49609200,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b480,
}

child #1: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00007f0f49609240,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b540,
}

This does not seem consistent with what the Python API offers:


print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5? (or better, vice-versa)

Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).

We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write


write child: Ffi_ArrowArray {
    length: 2,
    null_count: 0,
    offset: 1,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00000000021c8b20,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00000000024998f0,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00000000024f12d0,
    children: 0x00000000021c8ae0,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024999c0,
}

is consumed as


print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b))  # 1 2 <-- OK

which causes the check in this line to fail.

I was unable to find a test for a roundtrip of a sliced struct in pyarrow tests to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?

Reporter: Jorge Leitão / @jorgecarleitao

Note: This issue was originally created as ARROW-14383. Please see the migration documentation for further details.

asfimport commented 2 years ago

Jorge Leitão / @jorgecarleitao: This behavior (of slicing the child) is also present in FixedSizeListArray.

asfimport commented 2 years ago

Weston Pace / @westonpace: This is a good question as for what is expected via the C data interface.

I can maybe offer some insights as to why you're getting this behavior as I recently explored some of these cases as part of https://github.com/apache/arrow/pull/11542:

The C++ impl considers array offsets to be inherited by children (note, I don't see any description of this in the C data interface docs so we should clarify intent). This allows for (as you have noticed) slicing parent arrays without having to mutate child arrays. I also don't know of any requirement that the child arrays must have the same length as the parent arrays. This is similar to buffers not being required to have the same length as the containing array.

For most data types this is pretty straightforward to implement. It's a bit trickier for variable length data types (you need to use the parent array's offsets AND indices maps to determine the referenced portion of the child array).

It's a downright pain for dense unions. There is no way to easy way know what the appropriate offset of the children arrays if you are trying to determine which portion of the child arrays is referenced by the union without iterating the type ids array of the union.

The last notable exception is ArrowArray.dictionary. Slicing an arrow array in C++ does not modify the ArrowArray.dictionary field at all and ArrowArray.dictionary does not inherit the offsets of the containing array.

asfimport commented 2 years ago

Weston Pace / @westonpace: I don't think I wrote that first sentence clearly. The above comment is "This is what the C++ implementation currently does" and not "this is how the C data interface should behave". It seems we need more documentation in the C data interface docs around this behavior.