Closed jorisvandenbossche closed 1 year ago
It would also be nice to do this in the ADBC libraries (which currently use a handwritten Python wrapper for this purpose)
Yes, certainly, if we decide to go this way, we should support it in the places we control that interacts with the C Data Interface like ADBC (we should also check with other produces/consumers if this is possible to use)
- The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)
This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.
- The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)
This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.
But can a consumer move the C struct content multiple times (and potentially calling the release callback multiple times?)
The other reason for renaming in the case of DLPack is that it checks for this in the capsule deleter, to only call the release callback of the struct in case it was not consumed/renamed (for example cupy's implementation). But that's in the end the same question I assume, because it also depends on whether it is OK to call the release callback multiple times.
Ah, I suppose this is handled by setting the release callback to NULL when moving the C struct content. And then the capsule deleter should check for that (instead of checking for it being renamed) ? Then the guideline for consuming a Arrow C struct through a PyCapsule would be to always move the struct?
Indeed, you don't call the release callback when moving the structure: https://arrow.apache.org/docs/format/CDataInterface.html#moving-an-array
The consumer can move the ArrowArray structure by bitwise copying or shallow member-wise copying. Then it MUST mark the source structure released (see “released structure” above for how to do it) but without calling the release callback. This ensures that only one live copy of the struct is active at any given time and that lifetime is correctly communicated to the producer.
As usual, the release callback will be called on the destination structure when it is not needed anymore.
The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)
I think this would still be useful.
The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)
I think this would still be useful.
Yes, we would still keep it, but just using a different check to see when it should be called (checking if the release callback is not null, instead of the changed capsule name, as dlpack does and as I mentioned in the top post)
I created a PR on the adbc (https://github.com/apache/arrow-adbc/pull/702) and pyarrow (https://github.com/apache/arrow/pull/35739) side as a small experiment with this.
Hi, I am creating the RC0 right now. I can add this in future RCs if they are created.
Describe the enhancement requested
Currently we have the various
_export_to_c
/_import_from_c
methods for working with the Arrow C Interface that expect integers as arguments for the struct pointers. We could also usePyCapsule
objects for this instead of integers (or (certainly initially) in addition to), inspired by a similar interface from DLPack.DLPack provides a stable in-memory data structure that allows exchanging array data between frameworks. It essentially plays the same role for ndarrays (tensors) as what the Arrow C interface does for arrow-compatible data (columnar data). It also defines a stable C ABI with a similar C struct definitions (header file).
In the DLPack project, apart from the stable C ABI struct, they also defined a python specification (including a method name to access the protocol, i.e.
__dlpack__
), see https://dmlc.github.io/dlpack/latest/python_spec.html#implementation for the details. And for that specification, they return not a raw pointer, but use aPyCapsule
object (a python object that represents an "opaque value", such as a pointer, and that can used by C extensions to pass such values through Python code to other C code, https://docs.python.org/3/c-api/capsule.html).Some details based on their implementation:
The proposal would be to mimic what DLPack does in places where we now expect or return a integer pointer (the interface needs to be different as the current
_export_to_c
, as we would now return a capsule, instead of having the return pointer as a parameter of the method).Component(s)
Python