Open genemerewether opened 5 months ago
To the immediate question: I believe the code you have above will result in nanoarrow::UniqueArray array
and nanoarrow::UniqueSchema schema
as "unique" owners: as long as those C++ objects are not deleted, you can safely access the fields of the pointed-to ArrowArray
and ArrowSchema
. This is true even if the Python objects are deleted (the Table and/or the Capsule): the constructor you invoked for the UniqueArray
and UniqueSchema
will move ownership from the capsules to the UniqueXXX
. The release callback will be called exactly once for each of the array and schema when the UniqueXXX
is deleted.
To the slightly larger question of how you get a Table
into another Python package: I would recommend invoking __arrow_c_array_stream__()
. I recommend this because then all of the looping happens in C++: if you have a table that for some reason has thousands of chunks, you won't pay any performance cost for a tight Python loop. (We did a tiny bit of work when writing the pyarrow/arrow-r bridges to verify that this is the case). In nanoarrow 0.4.0 (about to be released), I added some helpers to do that looping.
Your Python might look like:
parse_pyarrow_table(t.__arrow_c_array_stream__())
And your C++ might look like:
m.def("parse_pyarrow_table",
[](const pybind11::capsule& array_stream_capsule) {
nanoarrow::UniqueArrayStream stream(static_cast<ArrowArray*>(array_stream_capsule.get_pointer()))
nanoarrow::UniqueSchema schema;
nanoarrow::UniqueArray array;
NANOARROW_THROW_NOT_OK(ArrowArrayStreamGetSchema(stream.get(), schema.get()));
do {
array.reset();
NANOARROW_THROW_NOT_OK(ArrowArrayStreamReadNext(stream.get(), array.get(), nullptr));
// Do something with array
} while (array->release != nullptr);
Hi - it is nice to have a way to get pyarrow tables into c++ via nanoarrow and pybind - is this an officially supported use case? I have lingering questions about the ownership semantics here. Am I ok as long as the pyarrow table doesn't get deleted / garbage collected and I don't hold the nanoarrow table past the duration of the call?
This might be worth calling out as an advantage of nanoarrow - I've found this code very portable across compilers / versions, especially compared to
pybind
-ingarrow
C++ package.https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrow-pycapsule-interface
Python:
pybind
method: