apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.55k stars 766 forks source link

StructArray from ArrayData Ignores Offset #6151

Open kylebarron opened 2 months ago

kylebarron commented 2 months ago

Describe the bug

In https://github.com/kylebarron/arro3 I'm exporting arrow-rs functionality for general Python use. I seem to have hit a bug importing sliced arrays.

In import_array_pycapsules (which is vendored from arrow-rs code here) I have:

pub(crate) fn import_array_pycapsules(
    schema_capsule: &Bound<PyCapsule>,
    array_capsule: &Bound<PyCapsule>,
) -> PyResult<(ArrayRef, Field)> {
    validate_pycapsule_name(schema_capsule, "arrow_schema")?;
    validate_pycapsule_name(array_capsule, "arrow_array")?;

    let schema_ptr = unsafe { schema_capsule.reference::<FFI_ArrowSchema>() };
    let array = unsafe { FFI_ArrowArray::from_raw(array_capsule.pointer() as _) };

    let array_data = unsafe { arrow::ffi::from_ffi(array, schema_ptr) }
        .map_err(|err| PyTypeError::new_err(err.to_string()))?;
    dbg!(array_data.offset());

    let field = Field::try_from(schema_ptr).map_err(|err| PyTypeError::new_err(err.to_string()))?;
    let array = make_array(array_data);

    dbg!(array.offset());
    Ok((array, field))
}

Note the two dbg! macros. When invoked from Python with a pyarrow StructArray, the array offset is lost.

import pyarrow as pa
import pytest
from arro3.compute import struct_field

a = pa.array([1, 2, 3])
b = pa.array([3, 4, 5])
struct_arr = pa.StructArray.from_arrays([a, b], names=["a", "b"])
sliced = struct_arr.slice(1, 2)
sliced.offset # 1
pa.array(struct_field(sliced, [0]))
# <pyarrow.lib.Int64Array object at 0x10fa94700>
# [
#   1,
#   2
# ]

Note that the first two elements of a are kept, with the offset not used. I've isolated this to the two lines with dbg!. Those print:

[pyo3-arrow/src/ffi/from_python/utils.rs:84:5] array_data.offset() = 1
[pyo3-arrow/src/ffi/from_python/utils.rs:87:5] array.offset() = 0

In particular make_array does not check the offset from the base array: https://github.com/apache/arrow-rs/blob/80ed7128510bac114c6feec08c34ef3beed3a44a/arrow-array/src/array/struct_array.rs#L296-L311

To Reproduce

Here's the way to reproduce the upstream bug

git clone https://github.com/kylebarron/arro3
cd arro3
git checkout 9673b62
poetry install
poetry run maturin develop -m arro3-core/Cargo.toml
poetry run maturin develop -m arro3-compute/Cargo.toml
poetry run pytest

I can try to reproduce this in pure rust if needed, but that may not be possible because the StructArray seems to always export an offset of 0, and so it may not be easy to reproduce this importing behavior.

Expected behavior

Expected the array offset to be maintained.

Additional context

kylebarron commented 2 months ago

Note that the length of the array is also lost when importing a StructArray. I.e. if the imported array is shorter than its underlying buffer, that information is lost. For now, my workaround is to manually call slice on a StructArray when importing it via FFI

tustvold commented 2 months ago

StructArray is expected to push its offset into its children, is this not occurring? Is the issue that the offset doesn't roundtrip, which is expected, or that the data doesn't, which would be a bug?

kylebarron commented 2 months ago

StructArray is expected to push its offset into its children, is this not occurring?

This is not occurring for an ArrayData with a positive offset or non-full length.

Without manually calling StructArray.slice, the array (not the value of Array::offset) does not successfully round trip to Python.

tustvold commented 2 months ago

Ok I have updated the title to reflect this. It should be possible to reproduce by manually constructing an ArrayData and importing it into StructArray.

kylebarron commented 2 months ago

Here's a small repro that I think is showing what I mean:

#[cfg(test)]
mod test {
    use std::sync::Arc;

    use arrow::array::ArrayDataBuilder;
    use arrow_array::{make_array, Array, Int8Array, StructArray, UInt64Array};
    use arrow_schema::Field;

    #[test]
    fn test() {
        let a = Arc::new(Int8Array::from(vec![1, 2, 3, 4]));
        let b = Arc::new(UInt64Array::from(vec![1, 2, 3, 4]));
        let fields = vec![
            Field::new("a", a.data_type().clone(), true),
            Field::new("b", b.data_type().clone(), true),
        ];
        let original_struct_array = StructArray::new(fields.into(), vec![a, b], None);
        let array_data = original_struct_array.to_data();
        let builder: ArrayDataBuilder = array_data.into();

        // Set `offset` to 2
        let offset_array_data = builder.offset(2).len(2).build().unwrap();
        let reconstructed_struct_array = make_array(offset_array_data);

        dbg!(&reconstructed_struct_array);
        dbg!(original_struct_array.slice(2, 2));
    }
}

This prints:

[arro3-core/src/constructors.rs:129:9] &reconstructed_struct_array = StructArray
[
-- child 0: "a" (Int8)
PrimitiveArray<Int8>
[
  1,
  2,
  3,
  4,
]
-- child 1: "b" (UInt64)
PrimitiveArray<UInt64>
[
  1,
  2,
  3,
  4,
]
]
[arro3-core/src/constructors.rs:130:9] original_struct_array.slice(2, 2) = StructArray
[
-- child 0: "a" (Int8)
PrimitiveArray<Int8>
[
  3,
  4,
]
-- child 1: "b" (UInt64)
PrimitiveArray<UInt64>
[
  3,
  4,
]
]

My understanding is that setting offset and len on the ArrayDataBuilder should cause the same behavior as .slice(2, 2)