Open kylebarron opened 4 months ago
Note that the length of the array is also lost when importing a StructArray. I.e. if the imported array is shorter than its underlying buffer, that information is lost. For now, my workaround is to manually call slice
on a StructArray
when importing it via FFI
StructArray is expected to push its offset into its children, is this not occurring? Is the issue that the offset doesn't roundtrip, which is expected, or that the data doesn't, which would be a bug?
StructArray is expected to push its offset into its children, is this not occurring?
This is not occurring for an ArrayData
with a positive offset
or non-full length
.
Without manually calling StructArray.slice
, the array (not the value of Array::offset
) does not successfully round trip to Python.
Ok I have updated the title to reflect this. It should be possible to reproduce by manually constructing an ArrayData and importing it into StructArray.
Here's a small repro that I think is showing what I mean:
#[cfg(test)]
mod test {
use std::sync::Arc;
use arrow::array::ArrayDataBuilder;
use arrow_array::{make_array, Array, Int8Array, StructArray, UInt64Array};
use arrow_schema::Field;
#[test]
fn test() {
let a = Arc::new(Int8Array::from(vec![1, 2, 3, 4]));
let b = Arc::new(UInt64Array::from(vec![1, 2, 3, 4]));
let fields = vec![
Field::new("a", a.data_type().clone(), true),
Field::new("b", b.data_type().clone(), true),
];
let original_struct_array = StructArray::new(fields.into(), vec![a, b], None);
let array_data = original_struct_array.to_data();
let builder: ArrayDataBuilder = array_data.into();
// Set `offset` to 2
let offset_array_data = builder.offset(2).len(2).build().unwrap();
let reconstructed_struct_array = make_array(offset_array_data);
dbg!(&reconstructed_struct_array);
dbg!(original_struct_array.slice(2, 2));
}
}
This prints:
[arro3-core/src/constructors.rs:129:9] &reconstructed_struct_array = StructArray
[
-- child 0: "a" (Int8)
PrimitiveArray<Int8>
[
1,
2,
3,
4,
]
-- child 1: "b" (UInt64)
PrimitiveArray<UInt64>
[
1,
2,
3,
4,
]
]
[arro3-core/src/constructors.rs:130:9] original_struct_array.slice(2, 2) = StructArray
[
-- child 0: "a" (Int8)
PrimitiveArray<Int8>
[
3,
4,
]
-- child 1: "b" (UInt64)
PrimitiveArray<UInt64>
[
3,
4,
]
]
My understanding is that setting offset
and len
on the ArrayDataBuilder
should cause the same behavior as .slice(2, 2)
Describe the bug
In https://github.com/kylebarron/arro3 I'm exporting arrow-rs functionality for general Python use. I seem to have hit a bug importing sliced arrays.
In
import_array_pycapsules
(which is vendored from arrow-rs code here) I have:Note the two
dbg!
macros. When invoked from Python with a pyarrowStructArray
, the array offset is lost.Note that the first two elements of
a
are kept, with theoffset
not used. I've isolated this to the two lines withdbg!
. Those print:In particular
make_array
does not check theoffset
from the base array: https://github.com/apache/arrow-rs/blob/80ed7128510bac114c6feec08c34ef3beed3a44a/arrow-array/src/array/struct_array.rs#L296-L311To Reproduce
Here's the way to reproduce the upstream bug
I can try to reproduce this in pure rust if needed, but that may not be possible because the
StructArray
seems to always export anoffset
of0
, and so it may not be easy to reproduce this importing behavior.Expected behavior
Expected the array offset to be maintained.
Additional context