apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

[Python] PyArrow Capsule from Nanoarrow-built Interval Arrow Yields Unexpected Values #375

Closed WillAyd closed 8 months ago

WillAyd commented 8 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I am trying to work with interval data passed along the new pycapsule interface. I noticed that this seems to work fine in the python-space:

import pyarrow as pa
schema = pa.schema([("interval", pa.month_day_nano_interval())])
tbl = pa.Table.from_arrays([pa.array(
    [
        None,
        pa.scalar((1, 1, 1), type=pa.month_day_nano_interval()),
        pa.scalar((42, 42, 42), type=pa.month_day_nano_interval()),
        None,
    ]
)], schema=schema)
capsule = tbl.__arrow_c_stream__()
stream = pa.RecordBatchReader._import_from_c_capsule(capsule)
new = stream.read_all()
tbl == new  # True

However, when trying to read a capsule created in an extension via nanoarrow that provides an equivalent array I am getting unexpected results. Assuming the following extension built via nanoarrow:

```cpp #include #include namespace nb = nanobind; static auto releaseArrowStream(void *ptr) noexcept -> void { auto stream = static_cast(ptr); if (stream->release != nullptr) { ArrowArrayStreamRelease(stream); } } auto get_interval_capsule() -> nb::capsule { nanoarrow::UniqueSchema schema; ArrowSchemaInit(schema.get()); if (ArrowSchemaSetTypeStruct(schema.get(), 1)) { throw std::runtime_error("ArrowSchemaSetTypeStruct failed"); } if (ArrowSchemaSetType(schema->children[0], NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO)) { throw std::runtime_error("ArrowSchemaSetType failed"); } nanoarrow::UniqueArray array; if (ArrowArrayInitFromSchema(array.get(), schema.get(), nullptr)) { throw std::runtime_error("ArrowSchemaInitFromSchema failed"); } if (ArrowSchemaSetName(schema->children[0], "interval_column")) { throw std::runtime_error("ArrowSchemaSetName failed"); } if (ArrowArrayStartAppending(array.get())) { throw std::runtime_error("ArrowArrayStartAppending failed"); } struct ArrowInterval interval; ArrowIntervalInit(&interval, NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO); // row 0 ArrowArrayAppendNull(array->children[0], 1); if (ArrowArrayFinishElement(array.get())) { throw std::runtime_error("ArrowArrayFinishElement failed"); } // row 1 interval.months = 1; interval.days = 1; interval.ns = 1; if (ArrowArrayAppendInterval(array->children[0], &interval)) { throw std::runtime_error("Failed to append interval value"); } if (ArrowArrayFinishElement(array.get())) { throw std::runtime_error("ArrowArrayFinishElement failed"); } // row 2 interval.months = 42; interval.days = 42; interval.ns = 42; if (ArrowArrayAppendInterval(array->children[0], &interval)) { throw std::runtime_error("Failed to append interval value"); } if (ArrowArrayFinishElement(array.get())) { throw std::runtime_error("ArrowArrayFinishElement failed"); } // row 3 ArrowArrayAppendNull(array->children[0], 1); if (ArrowArrayFinishElement(array.get())) { throw std::runtime_error("ArrowArrayFinishElement failed"); } if (ArrowArrayFinishBuildingDefault(array.get(), nullptr)) { throw std::runtime_error("ArrowArrayFinishBuildingDefault failed"); } auto stream = (struct ArrowArrayStream *)malloc(sizeof(struct ArrowArrayStream)); if (ArrowBasicArrayStreamInit(stream, schema.get(), 1)) { free(stream); throw std::runtime_error("ArrowBasicArrayStreamInit failed"); } ArrowBasicArrayStreamSetArray(stream, 0, array.get()); return nb::capsule{stream, "arrow_array_stream", &releaseArrowStream}; } NB_MODULE(nanoarrow_mre, m) { m.def("get_interval_capsule", &get_interval_capsule); } ```

Coupled with this CMake file to build the extension:

```cmake cmake_minimum_required(VERSION 3.18) project(nanoarrow_mre LANGUAGES CXX) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) if (MSVC) else() add_compile_options(-Wall -Wextra) endif() find_package(Python COMPONENTS Interpreter Development.Module NumPy REQUIRED) # Detect the installed nanobind package and import it into CMake execute_process( COMMAND "${Python_EXECUTABLE}" -m nanobind --cmake_dir OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE NB_DIR) list(APPEND CMAKE_PREFIX_PATH "${NB_DIR}") find_package(nanobind CONFIG REQUIRED) include(FetchContent) FetchContent_Declare(nanoarrow-project GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git GIT_TAG b3c952a3e21c2b47df85dbede3444f852614a3e2 ) FetchContent_MakeAvailable(nanoarrow-project) nanobind_add_module(nanoarrow_mre NOMINSIZE nanoarrow_mre.cpp) target_link_libraries(nanoarrow_mre PRIVATE nanoarrow) set_target_properties(nanoarrow PROPERTIES POSITION_INDEPENDENT_CODE ON) ```

I get rather strange results:

import pyarrow as pa
import nanoarrow_mre
capsule = nanoarrow_mre.get_interval_capsule()
stream = pa.RecordBatchReader._import_from_c_capsule(capsule)
tbl = stream.read_all()

Here is what tbl ends up looking like:

>>> tbl
pyarrow.Table
interval_column: month_day_nano_interval
----
interval_column: [[null,null,42M42d42ns,0M0d0ns]]

As you can see from the result, the nulls are misplaced and we have likely lost the 1D1M1ns interval.

I don't think this is an issue with nanoarrow - I haven't seen it in ADBC and when inspecting the raw bytes I am seeing the expected data, so I think it is specific to how the capsules are being read back into pyarrow

@jorisvandenbossche @paleolimbot

Component(s)

Python

jorisvandenbossche commented 8 months ago

I quickly tested your MRE vs pyarrow using nanoarrow-python to inspect the data:

PyArrow:

import pyarrow as pa
schema = pa.schema([("interval", pa.month_day_nano_interval())])
tbl = pa.Table.from_arrays([pa.array(
    [
        None,
        pa.scalar((1, 1, 1), type=pa.month_day_nano_interval()),
        pa.scalar((42, 42, 42), type=pa.month_day_nano_interval()),
        None,
    ]
)], schema=schema)

In [5]: stream = na.c_array_stream(tbl)

In [6]: arr = s.get_next().child(0)

In [7]: arr
Out[7]: 
<nanoarrow.c_lib.CArray interval_month_day_nano>
- length: 4
- offset: 0
- null_count: 2
- buffers: (140484108394496, 140484108394560)
- dictionary: NULL
- children[0]:

In [8]: na.c_array_view(ar)
Out[8]: 
<nanoarrow.c_lib.CArrayView>
- storage_type: 'interval_month_day_nano'
- length: 4
- offset: 0
- null_count: 2
- buffers[2]:
  - <bool validity[1 b] 01100000>
  - <interval_month_day_nano data[64 b] (0, 0, 0) (1, 1, 1) (42, 42, 42) (0, ...>
- dictionary: NULL
- children[0]:

Your MRE:

In [1]: import nanoarrow_mre

In [2]: capsule = nanoarrow_mre.get_interval_capsule()

In [3]: import nanoarrow as na

In [4]: stream = na.c_lib.CArrayStream._import_from_c_capsule(capsule)

In [5]: stream
Out[5]: 
<nanoarrow.c_lib.CArrayStream>
- get_schema(): struct<interval_column: interval_month_day_nano>

In [6]: arr = stream.get_next().child(0)

In [7]: arr
Out[7]: 
<nanoarrow.c_lib.CArray interval_month_day_nano>
- length: 4
- offset: 0
- null_count: 2
- buffers: (94736573435584, 94736573573504)
- dictionary: NULL
- children[0]:

In [8]: na.c_array_view(arr)
Out[8]: 
<nanoarrow.c_lib.CArrayView>
- storage_type: 'interval_month_day_nano'
- length: 4
- offset: 0
- null_count: 2
- buffers[2]:
  - <bool validity[1 b] 00111111>
  - <interval_month_day_nano data[64 b] (0, 0, 0) (1, 1, 1) (42, 42, 42) (0, ...>
- dictionary: NULL
- children[0]:

So the data itself looks good (the (1, 1, 1) and (42, 42, 42) are still there), but it's the validity bitmap that is wrong. It masks the (1,1,1) value, and does not mask the 4th value, making this (0, 0, 0) visible.

jorisvandenbossche commented 8 months ago

So given that a different implementation (nanoarrow) also sees the wrong data, I assume it's actually an issue with the created data, and not with pyarrow (Arrow C++). I don't directly see something wrong with your code (but not very familiar with the appenders and ArrowArrayFinishElement etc), so it might also be a bug in the nanoarrow c code here.

jorisvandenbossche commented 8 months ago

Looking at the implementation of ArrowArrayAppendInterval, I suppose it is missing an ArrowBitmapAppend?

See https://github.com/apache/arrow-nanoarrow/blob/d5febf1c2a693338d76d691cdf486e4fe83ca128/src/nanoarrow/array_inline.h#L534-L575

And for example the ArrowArrayAppendDecimal just afterwards has this part:

  if (private_data->bitmap.buffer.data != NULL) {
    NANOARROW_RETURN_NOT_OK(ArrowBitmapAppend(ArrowArrayValidityBitmap(array), 1, 1));
  }
jorisvandenbossche commented 8 months ago

(moved the issue to https://github.com/apache/arrow-nanoarrow)