apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[Python] Creating a table from a sliced struct array drops the slice #44731

Open joseph-isaacs opened 1 week ago

joseph-isaacs commented 1 week ago

Describe the bug, including details regarding any error messages, version, and platform.

Currently on pyarrow 17.0.0 creating a table from a sliced struct array ignores slice bounds

>>> pa.table(pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(0, 1))
pyarrow.Table
a: int64
----
a: [[0,1,2]]

I expect

a: [[0]]

Component(s)

Python

raulcd commented 1 week ago

This is the current behavior on main too. slice computes a zero copy slice of the array by updating length and/or offset where necessary:

>> import pyarrow as pa
>> import nanoarrow as na
>>> original_array = pa.array([{'a': 0}, {'a': 1}, {'a': 2}])
>>> sliced_array = original_array.slice(0,1)
>>> sliced_array
<pyarrow.lib.StructArray object at 0x764b3cf5d2a0>
-- is_valid: all not null
-- child 0 type: int64
  [
    0
  ]
>>> na.array(sliced_array).inspect()
<ArrowArray struct<a: int64>>
- length: 1
- offset: 0
- null_count: 0
- buffers[1]:
  - validity <bool[0 b] >
- dictionary: NULL
- children[1]:
  'a': <ArrowArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int64[24 b] 0 1 2>
    - dictionary: NULL
    - children[0]:
>>> na.array(original_array).inspect()
<ArrowArray struct<a: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers[1]:
  - validity <bool[0 b] >
- dictionary: NULL
- children[1]:
  'a': <ArrowArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int64[24 b] 0 1 2>
    - dictionary: NULL
    - children[0]:

I understand the use case but I am unsure what should be the behavior in order to generate the RecordBatch if we have updated the offset with the slice as an example:

>>> pa.table(pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(1,2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 6172, in pyarrow.lib.table
    batch = record_batch(data, schema)
  File "pyarrow/table.pxi", line 5991, in pyarrow.lib.record_batch
    batch = RecordBatch._import_from_c_device_capsule(schema_capsule, array_capsule)
  File "pyarrow/table.pxi", line 4002, in pyarrow.lib.RecordBatch._import_from_c_device_capsule
    batch = GetResultValue(ImportDeviceRecordBatch(c_array, c_schema))
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowInvalid: ArrowArray struct has non-zero offset, cannot be imported as RecordBatch
>>> pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(1,2)
<pyarrow.lib.StructArray object at 0x764b3cf5d8a0>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    2
  ]
>>>

@jorisvandenbossche @pitrou