apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[Python] Use PyCapsule for communicating C Data Interface pointers at the Python level #34031

Closed jorisvandenbossche closed 1 year ago

jorisvandenbossche commented 1 year ago

Describe the enhancement requested

Currently we have the various _export_to_c / _import_from_c methods for working with the Arrow C Interface that expect integers as arguments for the struct pointers. We could also use PyCapsule objects for this instead of integers (or (certainly initially) in addition to), inspired by a similar interface from DLPack.

DLPack provides a stable in-memory data structure that allows exchanging array data between frameworks. It essentially plays the same role for ndarrays (tensors) as what the Arrow C interface does for arrow-compatible data (columnar data). It also defines a stable C ABI with a similar C struct definitions (header file).

In the DLPack project, apart from the stable C ABI struct, they also defined a python specification (including a method name to access the protocol, i.e. __dlpack__), see https://dmlc.github.io/dlpack/latest/python_spec.html#implementation for the details. And for that specification, they return not a raw pointer, but use a PyCapsule object (a python object that represents an "opaque value", such as a pointer, and that can used by C extensions to pass such values through Python code to other C code, https://docs.python.org/3/c-api/capsule.html).

Some details based on their implementation:

The proposal would be to mimic what DLPack does in places where we now expect or return a integer pointer (the interface needs to be different as the current _export_to_c, as we would now return a capsule, instead of having the return pointer as a parameter of the method).

Component(s)

Python

lidavidm commented 1 year ago

It would also be nice to do this in the ADBC libraries (which currently use a handwritten Python wrapper for this purpose)

jorisvandenbossche commented 1 year ago

Yes, certainly, if we decide to go this way, we should support it in the places we control that interacts with the C Data Interface like ADBC (we should also check with other produces/consumers if this is possible to use)

pitrou commented 1 year ago
  • The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)

This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.

jorisvandenbossche commented 1 year ago
  • The consumer renames the capsule (which gives some protection to the C Data pointer being consumed more than once)

This doesn't seem necessary in our case, as the consumer would typically move the C struct contents.

But can a consumer move the C struct content multiple times (and potentially calling the release callback multiple times?)

The other reason for renaming in the case of DLPack is that it checks for this in the capsule deleter, to only call the release callback of the struct in case it was not consumed/renamed (for example cupy's implementation). But that's in the end the same question I assume, because it also depends on whether it is OK to call the release callback multiple times.

jorisvandenbossche commented 1 year ago

Ah, I suppose this is handled by setting the release callback to NULL when moving the C struct content. And then the capsule deleter should check for that (instead of checking for it being renamed) ? Then the guideline for consuming a Arrow C struct through a PyCapsule would be to always move the struct?

pitrou commented 1 year ago

Indeed, you don't call the release callback when moving the structure: https://arrow.apache.org/docs/format/CDataInterface.html#moving-an-array

The consumer can move the ArrowArray structure by bitwise copying or shallow member-wise copying. Then it MUST mark the source structure released (see “released structure” above for how to do it) but without calling the release callback. This ensures that only one live copy of the struct is active at any given time and that lifetime is correctly communicated to the producer.

As usual, the release callback will be called on the destination structure when it is not needed anymore.

westonpace commented 1 year ago

The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)

I think this would still be useful.

jorisvandenbossche commented 1 year ago

The PyCapsule has a destructor defined that would call the release callback (in case the object never got consumed)

I think this would still be useful.

Yes, we would still keep it, but just using a different check to see when it should be called (checking if the release callback is not null, instead of the changed capsule name, as dlpack does and as I mentioned in the top post)

jorisvandenbossche commented 1 year ago

I created a PR on the adbc (https://github.com/apache/arrow-adbc/pull/702) and pyarrow (https://github.com/apache/arrow/pull/35739) side as a small experiment with this.

raulcd commented 1 year ago

Hi, I am creating the RC0 right now. I can add this in future RCs if they are created.