apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

Using nanoarrow C++ with `pybind11::Capsule` #377

Open genemerewether opened 5 months ago

genemerewether commented 5 months ago

Hi - it is nice to have a way to get pyarrow tables into c++ via nanoarrow and pybind - is this an officially supported use case? I have lingering questions about the ownership semantics here. Am I ok as long as the pyarrow table doesn't get deleted / garbage collected and I don't hold the nanoarrow table past the duration of the call?

This might be worth calling out as an advantage of nanoarrow - I've found this code very portable across compilers / versions, especially compared to pybind-ing arrow C++ package.

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrow-pycapsule-interface

Python:

n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
t = pa.Table.from_arrays([n_legs, animals], names=names)

parse_pyarrow_table(*t.to_batches()[0].__arrow_c_array__())

pybind method:

 m.def("parse_pyarrow_table",
        [](const pybind11::capsule& schema_capsule, const pybind11::capsule& array_capsule) {
          LOG(INFO) << array_capsule.name() << "; " << schema_capsule.name();

          nanoarrow::UniqueArray array(static_cast<ArrowArray*>(array_capsule.get_pointer()));
          nanoarrow::UniqueSchema schema(static_cast<ArrowSchema*>(schema_capsule.get_pointer()));
...
paleolimbot commented 5 months ago

To the immediate question: I believe the code you have above will result in nanoarrow::UniqueArray array and nanoarrow::UniqueSchema schema as "unique" owners: as long as those C++ objects are not deleted, you can safely access the fields of the pointed-to ArrowArray and ArrowSchema. This is true even if the Python objects are deleted (the Table and/or the Capsule): the constructor you invoked for the UniqueArray and UniqueSchema will move ownership from the capsules to the UniqueXXX. The release callback will be called exactly once for each of the array and schema when the UniqueXXX is deleted.

To the slightly larger question of how you get a Table into another Python package: I would recommend invoking __arrow_c_array_stream__(). I recommend this because then all of the looping happens in C++: if you have a table that for some reason has thousands of chunks, you won't pay any performance cost for a tight Python loop. (We did a tiny bit of work when writing the pyarrow/arrow-r bridges to verify that this is the case). In nanoarrow 0.4.0 (about to be released), I added some helpers to do that looping.

Your Python might look like:

parse_pyarrow_table(t.__arrow_c_array_stream__())

And your C++ might look like:

m.def("parse_pyarrow_table",
        [](const pybind11::capsule& array_stream_capsule) {
          nanoarrow::UniqueArrayStream stream(static_cast<ArrowArray*>(array_stream_capsule.get_pointer()))
          nanoarrow::UniqueSchema schema;
          nanoarrow::UniqueArray array;

          NANOARROW_THROW_NOT_OK(ArrowArrayStreamGetSchema(stream.get(), schema.get()));

          do {
            array.reset();
            NANOARROW_THROW_NOT_OK(ArrowArrayStreamReadNext(stream.get(), array.get(), nullptr));
            // Do something with array
          } while (array->release != nullptr);