Closed WillAyd closed 8 months ago
I quickly tested your MRE vs pyarrow using nanoarrow-python to inspect the data:
PyArrow:
import pyarrow as pa
schema = pa.schema([("interval", pa.month_day_nano_interval())])
tbl = pa.Table.from_arrays([pa.array(
[
None,
pa.scalar((1, 1, 1), type=pa.month_day_nano_interval()),
pa.scalar((42, 42, 42), type=pa.month_day_nano_interval()),
None,
]
)], schema=schema)
In [5]: stream = na.c_array_stream(tbl)
In [6]: arr = s.get_next().child(0)
In [7]: arr
Out[7]:
<nanoarrow.c_lib.CArray interval_month_day_nano>
- length: 4
- offset: 0
- null_count: 2
- buffers: (140484108394496, 140484108394560)
- dictionary: NULL
- children[0]:
In [8]: na.c_array_view(ar)
Out[8]:
<nanoarrow.c_lib.CArrayView>
- storage_type: 'interval_month_day_nano'
- length: 4
- offset: 0
- null_count: 2
- buffers[2]:
- <bool validity[1 b] 01100000>
- <interval_month_day_nano data[64 b] (0, 0, 0) (1, 1, 1) (42, 42, 42) (0, ...>
- dictionary: NULL
- children[0]:
Your MRE:
In [1]: import nanoarrow_mre
In [2]: capsule = nanoarrow_mre.get_interval_capsule()
In [3]: import nanoarrow as na
In [4]: stream = na.c_lib.CArrayStream._import_from_c_capsule(capsule)
In [5]: stream
Out[5]:
<nanoarrow.c_lib.CArrayStream>
- get_schema(): struct<interval_column: interval_month_day_nano>
In [6]: arr = stream.get_next().child(0)
In [7]: arr
Out[7]:
<nanoarrow.c_lib.CArray interval_month_day_nano>
- length: 4
- offset: 0
- null_count: 2
- buffers: (94736573435584, 94736573573504)
- dictionary: NULL
- children[0]:
In [8]: na.c_array_view(arr)
Out[8]:
<nanoarrow.c_lib.CArrayView>
- storage_type: 'interval_month_day_nano'
- length: 4
- offset: 0
- null_count: 2
- buffers[2]:
- <bool validity[1 b] 00111111>
- <interval_month_day_nano data[64 b] (0, 0, 0) (1, 1, 1) (42, 42, 42) (0, ...>
- dictionary: NULL
- children[0]:
So the data itself looks good (the (1, 1, 1) and (42, 42, 42) are still there), but it's the validity bitmap that is wrong. It masks the (1,1,1) value, and does not mask the 4th value, making this (0, 0, 0) visible.
So given that a different implementation (nanoarrow) also sees the wrong data, I assume it's actually an issue with the created data, and not with pyarrow (Arrow C++).
I don't directly see something wrong with your code (but not very familiar with the appenders and ArrowArrayFinishElement
etc), so it might also be a bug in the nanoarrow c code here.
Looking at the implementation of ArrowArrayAppendInterval
, I suppose it is missing an ArrowBitmapAppend
?
And for example the ArrowArrayAppendDecimal
just afterwards has this part:
if (private_data->bitmap.buffer.data != NULL) {
NANOARROW_RETURN_NOT_OK(ArrowBitmapAppend(ArrowArrayValidityBitmap(array), 1, 1));
}
(moved the issue to https://github.com/apache/arrow-nanoarrow)
Describe the bug, including details regarding any error messages, version, and platform.
I am trying to work with interval data passed along the new pycapsule interface. I noticed that this seems to work fine in the python-space:
However, when trying to read a capsule created in an extension via nanoarrow that provides an equivalent array I am getting unexpected results. Assuming the following extension built via nanoarrow:
Coupled with this CMake file to build the extension:
I get rather strange results:
Here is what tbl ends up looking like:
As you can see from the result, the nulls are misplaced and we have likely lost the 1D1M1ns interval.
I don't think this is an issue with nanoarrow - I haven't seen it in ADBC and when inspecting the raw bytes I am seeing the expected data, so I think it is specific to how the capsules are being read back into pyarrow
@jorisvandenbossche @paleolimbot
Component(s)
Python