jorisvandenbossche commented 1 year ago

We added a new protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and new dunder methods __arrow_c_schema/array/stream__ (https://github.com/apache/arrow/issues/35531 / https://github.com/apache/arrow/pull/37797).

We recently also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (https://github.com/apache/arrow/pull/34972).

The currently merged PyCapsule protocol uses the stable non-device interface, but so the question is how to integrate the device version in the protocol in order to expose the C Device Data Interface in Python as well. Some options:

1) Only support the device versions going forward (like currently only the cpu version is supported, i.e. the returned capsules always contain a device array/stream). _{(this is a backwards incompatible change, but given we labeled the protocol as experimental, we can still make such changes if we think this is the best long-term option. The capsule names would reflect this change, thus this will generate a proper python error if a consumer or producer would not yet have been updated, and we can actually first deprecate the non-device support in pyarrow before removing it. All to say that AFAIU this is perfectly possible if we want it.)}

2) Add separate dunder methods __arrow_c_device_array__ and __arrow_c_device_stream__, and then it is up to the producer to implement those dunders if they can (and we can strongly recommend doing that, also for CPU-only libraries), and to consumers to check which ones are present.

3) Allow the consumer to request a device array with some keyword (eg __array_c_array__(device=True)), which gives the consumer the option to request it while also still giving the producer the possibility to raise an error if they don't (yet) support the device version.

4) Support both options in the current methods without keyword, i.e. allow __arrow_c_array__ to return both a "arrow_array" or "arrow_device_array" capsule (and their capsule name distinguishes both). With the recommendation to always return a device version if you can, but allowing producers to still return a cpu version if they don't support the device one. This only gives some flexibility to the producer, and no control to the consumer to request the CPU version (so this essentially expects that all consumers will handle the device version)

Options 2/3/4 are probably just variants of how to expose both interfaces, and thus the main initial question is whether we want to, long term, move towards an ecosystem where everyone uses the C Device Data Interface, or to keep using both interfaces side by side (as the main interchange mechanism, I mean, the device interface of course still embeds the standard struct).

jorisvandenbossche commented 1 year ago

Looking at the DLPack python page (https://dmlc.github.io/dlpack/latest/python_spec.html), they also define a __dlpack_device__ dunder method, and have a stream keyword in the __dlpack__ method. Do we need something similar here for the sync event? My understanding is that we don't need anything here in addition to the actual C struct, since the sync_event is already part of that? (cc @kkraus14)

pitrou commented 1 year ago

I would in favor of the second approach (separate dunder methods). Python methods are cheap, and these are really two different protocols.

paleolimbot commented 1 year ago

I would in favor of the second approach (separate dunder methods). Python methods are cheap, and these are really two different protocols.

I also would favour a separate dunder method. A consumer that is expecting an ArrowDeviceArray or stream almost certainly has the capability to construct one (i.e., the producer not implementing the method is not a barrier).

kkraus14 commented 1 year ago

The downside of using a separate dunder method is that you run into ecosystem fragmentation and inefficiencies. I.E. there was a similar effect that happened with __array_interface__ (https://numpy.org/doc/stable/reference/arrays.interface.html) and __cuda_array_interface__ (https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html). Libraries and applications primarily developed against numpy despite the __array_function__ protocol which made them think and develop in a CPU centric way despite just using array APIs. GPU libraries then needed to either implicitly copy to CPU to support __array_interface__ or raise an Exception to try to guide the developer in the right direction, neither of which is a good developer UX.

Compare that to dlpack (https://github.com/dmlc/dlpack) and the associated Python protocol surrounding it (https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html) which has had device support as a primary goal from its inception. Deep Learning libraries that have multiple device backends like PyTorch, Tensorflow, JAX, etc. all support dlpack but they don't support __array_interface__ or __cuda_array_interface__ because of those issues. Similarly, the __dataframe__ protocol (https://data-apis.org/dataframe-protocol/latest/API.html) has device support as a primary goal as well.

pitrou commented 1 year ago

I don't get the fragmentation and inefficiency argument. Implementing the C Data Interface is necessary to also implement the C Device Data Interface (since the latter is based on the former). Almost no implementation effort is saved by eschewing the former to go with the latter.

In any case, the C Data Interface already exists in C++, C#, Go, Java, Rust, nanoarrow and several third-party projects... It's simply logical to expose it in Python as well.

jorisvandenbossche commented 8 months ago

I would like to revive this, now there is some movement in exposing the Device Interface in pyarrow (low level bindings have been added, https://github.com/apache/arrow/issues/39979) and in supporting it in cudf (https://github.com/rapidsai/cudf/pull/15047).

Concretely, I think adding a separate set of dunder methods, __arrow_c_device_array__ and __arrow_c_device_stream__ (option 2 from the top post), would be the best option moving forward. It's not that it are entirely separate methods. For a producer/consumer that can handle the device interface, most of the code will be shared with the code required to handle the standard C interface, given that the C Device Interface is only a small extension on top of the C Interface struct. But keeping the dunder methods separate on the Python side allows libraries that only support the C Data Interface to still implement that part of the PyCapsule protocol. The absence of the device versions of the dunder methods is then also an indication that this producer only supports CPU data.

The actual methods can mimic the current ones (see https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export), just with adapted names:

__arrow_c_device_array__ returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
__arrow_c_device_stream__ returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream"

And both methods can then similarly also accept a requested_schema keyword.

Some questions:

My understanding is that in the Python API for the dunder methods, we don't need to expose anything to deal with the sync_event, as that is entirely handled by the consumer through the struct itself (so in contrast with the DLPack __dlpack__ method which does have a stream keyword. But they way how the even works, this is not needed). This is correct?
Do we want to add an __arrow_c_device__ method that returns the device of the array or stream? This would be similar as the __dlpack_device__ from DLPack, and I suppose essentially just an easier way to quickly check the device of the object. For DLPack, one reason to have this is to check the device before you can pass the correct stream to __dlpack__ (according to https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack), and that's of course not a relevant reason for us (based on the previous bullet point)
Similarly as with the PyCapsule protocol we already added, I would for now not specify anything about how the consumer side should look like (in contrast to DLPack which has a from_dlpack specified through the Array API). That also means it is up to the consumer library how to deal with device copies etc (although I assume typically a copy will be avoided unless explicitly asked?) EDIT: I see that the Array API standard recently added copy and device keywords to from_dlpack to be more explicit or ask to copy to a different device (https://github.com/data-apis/array-api/pull/741)

kkraus14 commented 8 months ago

My understanding is that in the Python API for the dunder methods, we don't need to expose anything to deal with the sync_event, as that is entirely handled by the consumer through the struct itself (so in contrast with the DLPack __dlpack__ method which does have a stream keyword. But they way how the even works, this is not needed). This is correct?

Yes, this is correct.

Do we want to add an __arrow_c_device__ method that returns the device of the array or stream? This would be similar as the __dlpack_device__ from DLPack, and I suppose essentially just an easier way to quickly check the device of the object. For DLPack, one reason to have this is to check the device before you can pass the correct stream to __dlpack__ (according to https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack), and that's of course not a relevant reason for us (based on the previous bullet point)

This can always be added later if there proves to be a need for it. I would push to keep things minimal for now to avoid needing to make breaking changes.

Similarly as with the PyCapsule protocol we already added, I would for now not specify anything about how the consumer side should look like (in contrast to DLPack which has a from_dlpack specified through the Array API). That also means it is up to the consumer library how to deal with device copies etc (although I assume typically a copy will be avoided unless explicitly asked?) EDIT: I see that the Array API standard recently added copy and device keywords to from_dlpack to be more explicit or ask to copy to a different device (Support copy and device keywords in from_dlpack data-apis/array-api#741)

+1 to not defining from_* methods yet. I suspect we'll eventually want something similar with copy and device keywords to pass to the producer, but I think we can wait for feedback before building things out on that front.

jorisvandenbossche commented 8 months ago

Thanks for the feedback @kkraus14. I started a PR updating the documentation page about the PyCapsule protocol to expand with those two new dunder methods -> https://github.com/apache/arrow/pull/40708

Apart from just defining __arrow_c_device_array__ and __arrow_c_device_stream__ and their return values, it might be useful to agree on and document some guidelines about when to implement which methods?

For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)
The presence of only the standard version (e.g. only __arrow_c_array__ and not __arrow_c_device_array__) means that this is a CPU-only data object.
For a device-aware library, and for data structures that can only reside in non-CPU memory, you should only implement the device version of the protocol (e.g. only add __arrow_c_device_array__, and never add a __arrow_c_array__)
- Libraries can of course have data structures that can live on both CPU or non-CPU, and for those it is fine that they implement both versions (and error in the non-device version if the data is not on the CPU)? EDIT: this has to be fine of course, given that pyarrow is in this situation, and we want to define both methods. But should we error in __arrow_c_array__ for non-CPU data? (right now we don't actually check the device here, but silently return an ArrowArray struct with null buffer pointers)
Do we want to say something about expectations that no cross-device copies happen? Although we don't specify anything about a public consumer method (like from_dlpack), we could still give the guideline to not let such method do device copies? Although that is maybe fully up to the consumer library to decide, but at least in the pyarrow consumer methods, we will not do any copies.

pitrou commented 8 months ago

For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

+0. I'm not sure it makes sense to ask producers for more effort in this regard.

The presence of only the standard version (e.g. only __arrow_c_array__ and not __arrow_c_device_array__) means that this is a CPU-only data object.

+1

For a device-aware library, and for data structures that can only reside in non-CPU memory, you should only implement the device version of the protocol (e.g. only add __arrow_c_device_array__, and never add a __arrow_c_array__)

+1

Libraries can of course have data structures that can live on both CPU or non-CPU, and for those it is fine that they implement both versions (and error in the non-device version if the data is not on the CPU)?

+1

EDIT: this has to be fine of course, given that pyarrow is in this situation, and we want to define both methods. But should we error in __arrow_c_array__ for non-CPU data?

Yes, we should. The expectation of the (regular) C Data Interface is that data lives on the CPU.

Do we want to say something about expectations that no cross-device copies happen?

In the producer or in the consumer? IMHO the consumer is free to do whatever suits them. On the producer side the question is a bit more delicate. Perhaps we need to pass some options to __arrow_c_device_array__.

paleolimbot commented 8 months ago

+1 to all! Perhaps just +0 to:

For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

I am not sure I would recommend that (although it is easy to provide producers boilerplate cython or helper functions to help them do it if they would like).

Do we want to say something about expectations that no cross-device copies happen?

As a consumer I suppose I would be very surprised if a producer performed a cross-device copy to happen in a call to __arrow_c_device_xxx__.

vyasr commented 7 months ago

Do we want to say something about expectations that no cross-device copies happen?

My preference would be that at the outset the interface recommends that no implicit copies occur, but that there is no normative language prohibiting it. I agree with @paleolimbot that in a vacuum a copy would be surprising, but for objects that support data resident on either host or device (e.g. a pytorch array) it could be reasonable for this to happen. If a library already supports implicit back-and-forth copying via its public APIs then it should be acceptable for the same to happen via the capsule interface (in either direction, i.e. you could access the device interface and trigger H2D or vice versa), whereas if a library requires some explicit transfer mechanism (e.g. APIs like pytorch.tensor.to(torch.device(...))) then the library would probably require the same before accessing the associated capsule interface.

Perhaps we need to pass some options to __arrow_c_device_array__.

I would recommend that we wait on this and see how the implementations look in practice. We could always add a allow_copy=False parameter with a default later without breaking implementers of the interface. To properly support libraries that implement both the device and host interfaces we would presumably need to add the same parameter to both since implicit H2D and D2H copies are equally undesirable.

vyasr commented 7 months ago

Since @jorisvandenbossche mentioned the ongoing work in cudf's C++ to support the C device data interface in rapidsai/cudf#15047, I put together a (very, very preliminary) concept of what exposing this in Python could look like in rapidsai/cudf#15370. There is a lot of work to be done before that PR can come out of draft, but I'm happy to help people on this thread use it as a testing ground for ideas on the Python capsule interface if that's useful. Feedback is of course welcome, but given how preliminary the implementation is I wouldn't waste any time on details. For now I haven't implemented the dunder methods on any classes, but the free functions are doing the same capsule construction and should be trivial to toss into a class's dunder method once cudf decides exactly what to expose. In the meantime the free functions offer the same interface for testing.

kkraus14 commented 7 months ago

For a CPU-only library, it is encouraged to implement both the standard and device version of the protocol methods (i.e. both __arrow_c_array__ and __arrow_c_device_array__, and/or both __arrow_c_stream__ and __arrow_c_device_stream__)

I would be +1 on encouraging being able to consume both protocols, but +0.5 on encouraging to produce both protocols. Agreed with @pitrou's comment about it being more effort with the only return being future proofing for potential non-cpu usage.

I do think we should start discussing what does it look like for a CPU-only library to request data from a non-CPU library. I.E. if say Pandas wants to consume data produced by cuDF. Presuming we figure out a convenient api to pass that information to cuDF, would it return an __arrow_c_device_array__ with the device as CPU or return an __arrow_c_array__? The former puts an onus on all consumers to support the device variants of the protocol.

pitrou commented 7 months ago

Presuming we figure out a convenient api to pass that information to cuDF, would it return an __arrow_c_device_array__ with the device as CPU or return an __arrow_c_array__?

Do we want those protocols to support cross-device copies implicitly at all?

jorisvandenbossche commented 7 months ago

I do think we should start discussing what does it look like for a CPU-only library to request data from a non-CPU library.

Initially I would expect this to raise an error (i.e. by default indeed not allowing cross-device copies). In your example, pandas would check the device, see that it is not CPU, and therefore error that creating a pandas.DataFrame from non-CPU data is not possible.

But it's a good point that we at least should consider this case and decide whether we want to support more. If we want to make a cross-device copy possible, the idea is that we let the consumer specify a "requested device type" (like we have a requested schema), so that the producer can do the copy?

There might be use cases of enabling it as opt-in. For the example of cudf -> pandas (or -> polars, or duckdb, or any other CPU-only library), if a user actually wants the data to be copied, pandas cannot do this themselves, and it would be cudf that need to perform the copy. So if we want to allow that through this interface, there needs to be a way to signal that.

Of course we can (initially) say that this interface doesn't support that. But that does mean that if pandas wants to support ingesting (copying) non-CPU data generically, not tied to a specific library, that's not really possible. Because it would first need do the device-to-host copy using the passed object's APIs (eg for a cudf DataFrame call some cudf-specific method to copy that to CPU memory), losing the benefits of a generic protocol.

paleolimbot commented 7 months ago

if a user actually wants the data to be copied, pandas cannot do this themselves, and it would be cudf that need to perform the copy.

That is a great point! The producer almost certainly knows how to perform a copy in the most efficient way, but the consumer, particularly a CPU consumer, probably does not want to handle those details.

More broadly, for all of the __arrow_c_xxx__ protocol methods, the caller does not have any way to know how expensive calling the method will be (e.g., it may perform a copy or allocate quite a lot if transforming from item/rowwise to Arrow). Making all copies for all methods explicit seems safer but, as Joris mentioned, severely limits the utility of the protocol.

pitrou commented 7 months ago

Does this mean that we'd like to expand the protocol in https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrowarray-export to take additional arguments?

If so, this begs the question: should there be a more robust mechanism to add optional arguments after some producer implementations have already been published?

pitrou commented 7 months ago

That said, the notion of schema requests pretty much implies that we allow implicit data copies when doing a producer-side conversion. So perhaps we're overthinking this, and producers of non-CPU data should simply implement the C Data Interface protocol with implicit cross-device copies.

jorisvandenbossche commented 7 months ago

For the points mentioned above, except for the current "allow triggering device copy" discussion, I did an attempt summarizing this in an additional section in the spec PR https://github.com/apache/arrow/pull/40708. Please take a look!

(although some parts will have to be updated if we would allow implicit cross-device copies)

kkraus14 commented 7 months ago

Are there cases where we need zero copy regardless of going across devices?

Do we ever want to support mutating memory via these interfaces? If so, we don't have a clear way to guarantee we don't mutate a copy.
If a consumer has the ability to generically handle both array-like things and dataframe-like things, it makes it a bit harder to implement efficiently. I.E. I might want to write something that has a flow like "try to get arrow zero copy, fall back to dlpack zero copy, fall back to arrow copy, fall back to dlpack copy".

Assuming we make copies implicit via __arrow_c_array__, this makes the only copy we'll support handling is device --> CPU for now. What would addressing along more generic copy requests later look like?

pitrou commented 7 months ago

Are there cases where we need zero copy regardless of going across devices?

I don't think so, but this is all very new so it's difficult to guess.

Do we ever want to support mutating memory via these interfaces?

Not officially IMHO. Specific consumers and providers may want to ensure this, but that would rely on external conventions.

jorisvandenbossche commented 7 months ago

Assuming we make copies implicit via __arrow_c_array__, this makes the only copy we'll support handling is device --> CPU for now. What would addressing along more generic copy requests later look like?

I assume we could have something like obj.__arrow_device_array__(requested_device=kCPU), and then it is up to the producer to see if they can provide the data on that device, and if not error or return on native device (depending on whether the requested device should be followed strictly. For requested_schema we decided this was only best effort).

So perhaps we're overthinking this, and producers of non-CPU data should simply implement the C Data Interface protocol with implicit cross-device copies.

That means that passing such a non-CPU object, like a cudf DataFrame, to an interface that can consume data through this protocol (eg pandas or polars constructors, duckdb query with implicit variable, ...) would automatically do a potentially costly device copy of the full data structure. I am a bit hesitant to do enable that implicitly, that might be unexpected in some cases? (although maybe also convenient ..).

If so, this begs the question: should there be a more robust mechanism to add optional arguments after some producer implementations have already been published?

I assume the simple but verbose way to do this is to put the onus on the consumer: if they want to use a newer keyword, they need to do that in a try/except, falling back on the version without the keyword, such that it works for producers that support or do not yet support the new keyword.

(we could in theory already add a catch-all **kwargs to the protocol methods, but that will then silently ignore new keywords if not yet supported by a certain producer, so not sure that is better than raising an error)

jorisvandenbossche commented 7 months ago

I assume the simple but verbose way to do this is to put the onus on the consumer: if they want to use a newer keyword, they need to do that in a try/except

That also seems what DLPack recommends for calling __dlpack__ after they added some additional keywords, see last example at https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html

zeroshade commented 7 months ago

My personal opinion and sensibilities likes the suggested obj.__arrow_device_array__(requested_device=kCPU) and putting the onus on the producer to attempt to provide the data on that device. When that param is left as None then the producer should provide zero-copy to the device that the data is currently on.

That means that passing such a non-CPU object, like a cudf DataFrame, to an interface that can consume data through this protocol (eg pandas or polars constructors, duckdb query with implicit variable, ...) would automatically do a potentially costly device copy of the full data structure. I am a bit hesitant to do enable that implicitly, that might be unexpected in some cases? (although maybe also convenient ..).

For clarity I always prefer to avoid the implicit copy unless it's entirely necessary for adoption.

paleolimbot commented 7 months ago

It seems like perhaps __arrow_c_device__ should not perform a cross-device copy, but perhaps the producer could be free to decide if implementing __arrow_c_array__ and allowing an implicit copy is going to be problematic. I imagine most producers would prefer not to allow this implicit copy but in the grand scheme of transformations that might occur if __arrow_c_array__ is called (e.g., bitpacking uint8 vectors to boolean, converting decimal32s to decimal128s), it might be that the device copy is just not a problem. In any case, the producer has all the right information to decide about this.

That would mean that a consumer of __arrow_c_device__ has to have a plan to handle data that isn't on the correct device yet, and I think that is reasonable. If they are a GPU library, they almost certainly have a way to get the data to the device they want. If they are a CPU-only library, they should have called __arrow_c_array__. I wonder if adding a requested_device parameter is a feature that should be added when its utility becomes clear (which is a good reason for perhaps documenting the plan for evolution of the protocol).

vyasr commented 7 months ago

I like the obj.__arrow_device_array__(requested_device=device) API. Producers should be responsible for handling the conversion and providing suitable errors if necessary. I'm skeptical of making this conversion be "best effort" though; the potential error modes are quite bad (e.g. segfaults) and won't be very friendly even for implementers of a given consumer.

Putting the onus on consumers to specify a valid set of arguments depending on the protocol version does seem appropriate.

Deciding whether or not making a particular kind of copy is too expensive seems out of scope for the protocol. Ultimately I'm not sure even producers should be completely in the business of deciding what is too expensive. Producers know best how to perform the copies and how expensive those copies are in absolute terms, but ultimately only end users know the cost relative to their overall workflow and whether the convenience is worthwhile. Producers could be configurable in some way to determine whether to allow implicit copies, but such configuration is out of scope of this spec. Consumers should be able to make the desired request of a producer, allowing producers to choose to honor that request or fail gracefully (e.g. cudf could decide not to allow implicit D2H copies unless the user first does cudf.set_option("allow_implicit_arrow_copies", True); without that option set, any implicit copy could raise an Exception).

That said, the notion of schema requests pretty much implies that we allow implicit data copies when doing a producer-side conversion.

@pitrou could you clarify what you mean by that?

pitrou commented 7 months ago

That said, the notion of schema requests pretty much implies that we allow implicit data copies when doing a producer-side conversion.

@pitrou could you clarify what you mean by that?

Well, quoting the spec:

In some cases, there might be multiple possible Arrow representations of the same data. For example, a library might have a single integer type, but Arrow has multiple integer types with different sizes and sign.

If different representations of the same data can be returned, then it means that at least some of these representations incur a copy. For example, if a library has a string type encoded as (length, ptr) string views, then regardless of whether the producer exposes them as a Arrow String, LargeString or StringView, it needs to convert the lengths into Arrow offsets, and to linearize the string data as a contiguous buffer.

jorisvandenbossche commented 7 months ago

Practical experimentation will help in informing the decisions we have to make here regarding the control of cross-device copies (e.g. would a requested_device keyword be useful?). Therefore I would like to suggest that we start with a minimal addition (just the new methods as currently described in https://github.com/apache/arrow/pull/40708, without further keywords), and get the implementation for pyarrow merged for 16.0. The guidelines / recommendations section can later be updated while we get experience with the first implementations.

Based on the above discussion, I would add the following to the PR?

The device protocol methods should return data as-is on the device it is currently on (i.e. the expectation is that there is no cross-device copy happening in this method) (sidenote: of course in case someone would implement a tabular object that could use different devices for different columns, this guarantee of "no device copy" cannot be made, given that the resulting structure's data should live on a single device. But that seems a corner case not worth mentioning in (complicating) the spec?)
A device-aware producer can implement __arrow_c_array/stream__ that does an implicit device to CPU copy when called (but this is currently not explicitly standardized as the recommended way to provide CPU data). (this means that a consumer supporting multiple devices (like pyarrow) should always first check the device protocol methods before the CPU-only versions. And checking my PR implementing this for pyarrow (https://github.com/apache/arrow/pull/40717), I see I need to update for that)

Or would people prefer to directly go with a more explicit API like requested_device keyword?

paleolimbot commented 7 months ago

I agree with all of these points, although I am admittedly coming from the CPU world. I think with the initial restrictions that cross-device copies should not be made when calling a device protocol method we get the benefit of interoperability for most use cases while maintaining backward compatibility should there be a need for the consumer to negotiate the producer to perform a copy. I also think that initial implementations will be critical to ensure that we did not miss something else entirely.

zeroshade commented 7 months ago

my vote is for the explicit requested_device keyword...

pitrou commented 7 months ago

Well, for me the question is: how do we later add options to the API without breaking compatibility with producers that don't implement those options?

paleolimbot commented 7 months ago

That is a good question!

I think Joris mentioned earlier that one approach would be to try: and do something else if there was an error passing the extra keywork argument. That seems like it would scale poorly and perhaps lead to confusing errors but there may be precedent/a better way to do this.

It seems like inspect.signature() makes it possible to introspect argument names, if the evolution of this protocol can be fully captured by the existence of formal argument names. I checked and it seems to work for methods implemented in Cython (worth checking that this is also true for methods implemented in pybind11 or nanobind).

One could also provide __arrow_c_capabilities__ (or something) such that hasattr(x, "__arrow_c_capabilities__") and "something" in x.__arrow_c_capabilities__ captures the ability to use one or more of the protocol methods in a particular way. Perhaps versioning the protocol and querying that would be more straightfoward.

Or perhaps other ways?

jorisvandenbossche commented 7 months ago

Well, for me the question is: how do we later add options to the API without breaking compatibility with producers that don't implement those options?

I am not sure if other options are better than how this works in general for any python API: it's the responsibility of the user to ensure the used keyword is supported. Of course, typically you do that by means of a library version check (for a new keyword you want to use added in a certain version of that library), which is not possible here. But you can still do that as user (i.e. consumer) of the protocol methods, for example by using try/except or by inspecting the signature. It's not very nice, but certainly possible.

We could add a "version" number to the capsule protocol, or "capabilities" as Dewey mentioned above (some __arrow_c_version__ or __arrow_c_capabilities__?), but in the end this still keeps the responsibility on the consumer side, but "just" tries to make this check a bit easier (eg a version check to pass different keywords, instead of a try/except with different keywords). But in terms of code on the consumer side, I don't think this will make things that much easier, so I am not sure this is worth the complexity.

(also, when we would add something like a "version" concept, we need to be very clear that this is a Python side capsule protocol thing, and not a version of the C Data Interface itself)

pitrou commented 7 months ago

Ok, I think you're right @jorisvandenbossche .

paleolimbot commented 7 months ago

I am not sure if other options are better than how this works in general for any python API: it's the responsibility of the user to ensure the used keyword is supported.

I also agree...there are some options if this becomes a problem, but I don't think anything should be added from the producer side.

I think the ability to experiment with this in a released version of pyarrow (16.0.0) for the next few months is much more valuable than getting the perfect set of initial keyword arguments, which can be finalized when the protocol methods come out of an EXPERIMENTAL state.

kkraus14 commented 7 months ago

I'm +1 to have a versioning that can be queried and/or provided by consumers. Some prior art is how the array API handled this:

https://data-apis.org/array-api/latest/API_specification/generated/array_api.__array_api_version__.html
- There's a queryable version of the specification
https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__array_namespace__.html
- Consumer optionally provides a version to the producer, which can raise a NotImplementedError if the version is not supported

The other thing that comes to mind is that a lot of the capabilities being discussed here are definitely not limited to Python where they'd ideally be part of a C level spec / protocol that Python binds instead of implementing a separate protocol in Python.

jorisvandenbossche commented 7 months ago

The other thing that comes to mind is that a lot of the capabilities being discussed here are definitely not limited to Python where they'd ideally be part of a C level spec / protocol

Versioning of or tracking of capabilities in the C Data Interface itself is out of scope for this discussion, I would say (it's an interesting topic and has come up before, I think, but a much bigger discussion, eg because it requires a new struct).

And so if we are only speaking about versioning the python-level capsule interface, I still doubt that this is worth it (for the reasons I mentioned above, https://github.com/apache/arrow/issues/38325#issuecomment-2043536682, checking a version will almost be as much work as checking if a keyword is available in a python method).

The way that the Array API implemented this also isn't directly applicable in our case, AFAIU, given we don't have a namespace but just a method (so letting the consumer pass a version keyword to the dunder method would not help to determine which other keywords are available to be specified). As mentioned above, we could add something like __arrow_c_capsule_version__ returning 1.0 (added the "capsule" to make it clear it is about the capsule interface, not the c interface itself), and that version could indicate which methods are available. But even then as a consumer you might need to ensure to handle failures for a certain keyword anyway (for example, we currently have the requested_schema keyword, but even in pyarrow we initially didn't implement that for all cases, so potentially raising a NotImplementedError when that is passed. Do we then fully implement version 1.0 and can we return that as version?)

vyasr commented 7 months ago

While I agree that supporting a version attribute is more work, it also feels more natural than introspecting an API to determine what is supported. If I read code relying on signature inspection I assume it's doing something fairly hacky by default (which isn't to say that I haven't written such code on multiple occasions before...).

If we don't want versioning, another option (that I'm not necessarily a huge fan of but is worth considering) is defining the protocol as a function that accepts arbitrary keyword arguments obj.__arrow_device_array__(**kwargs). Then producers can choose what arguments to support. We could allow consumers to opt out (or opt in) to error-handling:

class Foo:
    def __arrow_device_array__(self, *, strict=True, **kwargs):
        try:
            parse_kwargs(kwargs)
        except BaseException as e:
            if strict:
                raise
            else:
                warn_about_unsupported_kwargs(kwargs, e)

This way all options are only honored if they are supported. Otherwise, it's up to the consumer if they want this to fail loudly or if they want to handle things themselves.

One downside of this approach is that it assumes that consumers are fine stating the superset of all options that they want and then accepting whether or not producers will honor those options. If consumers want to dynamically adjust the calls based on the options available from a particular producer, they will have to resort to doing so more manually (e.g. via isinstance checks that sort of defeat the purpose of a protocol). It also precludes the signature-based introspection.

kkraus14 commented 7 months ago

Versioning of or tracking of capabilities in the C Data Interface itself is out of scope for this discussion, I would say (it's an interesting topic and has come up before, I think, but a much bigger discussion, eg because it requires a new struct).

It's moreso that we're talking about this topic because of the desire to introduce something like a requested_device consumer provided parameter, so it's not just a new struct, but a way for a consumer to provide information to a producer in a standardized way at a C-API level as opposed to Python-API level.

@vyasr's proposal above is interesting, and we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior. I.E. we could do something like:

v1 producer

class Foo:
    def __arrow_device_array__(self, *, strict=True, **kwargs):
        for key, value in kwargs.items():
            if value is not None:
                raise NotImplementedError(f"{key}" parameter is not supported)

        return capsule

v2 producer

class Foo:
    def __arrow_device_array__(self, *, strict=True, requested_device=None, **kwargs):
        for key, value in kwargs.items():
            if value is not None:
                raise NotImplementedError(f"{key}" parameter is not supported)

        # handle `requested_device` here in a meaningful way where None implies same behavior as v1
        return capsule

From a consumer perspective, passing any new parameter becomes purely optional and they can handle unsupported parameters via try / except handling. I don't think signature inspection works here because a producer implementation may explicitly handle a parameter by explicitly throwing on non-default values for example.

jorisvandenbossche commented 7 months ago

It's moreso that we're talking about this topic because of the desire to introduce something like a requested_device consumer provided parameter, so it's not just a new struct, but a way for a consumer to provide information to a producer in a standardized way at a C-API level as opposed to Python-API level.

Sorry, I don't understand this paragraph. How is adding the requested_device keyword or not (or ways to know if that keyword is supported) something that plays that the C-API level? The C API is the struct and there is nothing in there that allows passing options, that's all at the python level.

@vyasr's proposal above is interesting, and we could possibly restrict it even further in that we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior. I.E. we could do something like:

I don't have a strong objection to add that, but to be clear, this all still means that the consumer has to do the work to check if the keyword is supported or not or has been honored or not.

To use a concrete example, with the current proposal of no special mechanism, and if we add a requested_device keyword in the future, the consumer code could look like (for an example consumer that can only handle CPU data):

try:
    capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except TypeError:
    capsules = object.__arrow_c_device_array__()
    # manually check the returned array capsule has a CPU device and otherwise error

With the proposal above to add the **kwargs that raise if not None:

try:
    capsules = object.__arrow_c_device_array__(requested_device=kCPU)
except NotImplementedError:
    capsules = object.__arrow_c_device_array__()
    # manually check the returned array capsule has a CPU device and otherwise error

or if the strict keyword is used (and honored by the producer), this is a bit simpler:

capsules  = object.__arrow_c_device_array__(requested_device=kCPU, strict=False)
# manually check the returned array capsule has a CPU device and otherwise error

From a consumer perspective, passing any new parameter becomes purely optional and they can handle unsupported parameters via try / except handling.

Exactly the same is true for no explicit specification of adding **kwargs, the only difference is the error type to catch.

we could document that all evolutions to the protocol that introduce new parameters should have a default value of None that correspond to the previous behavior

I think that is something that we will do anyway in practice: if we add a new keyword in the future, it should always have a default that matches the previous behaviour, such that not specifying it preserves behaviour (otherwise it would be a breaking change). This is not exactly a "default value of None", but essentially comes down to the same?

To summarize, if there is a strong desire to add the **kwargs and checking of it, and potentially the strict=False option, to the specification, I can certainly live with that. But I do think that 1) it does not add that much compared to the base situation of not specifying it (also in that case you can handle it with a simple try/except), and 2) it adds some details to the specification that the consumer wants to rely on and that producers can get wrong (i.e. the fact that the producer will check the kwargs, the exact error type that the producer will raise if they are present, that the producer will honor a strict=False keyword). This is of course not hard to get right, but I am personally not sure it is worth the small gain.

pitrou commented 7 months ago

I think we should avoid vague and confusing terms such as strict, because it will bear a different meaning and expectations for each person.

As for requested_device, if we think it is needed, then let's just add it immediately?

As for always taking **kwargs, this looks reasonable to me. Also, we can mandate that None be the default value for all optional arguments we add in the future. This makes checking the kwargs for non-default values very easy, for example:

    non_default_kwargs = [name for name, value in kwargs.items() if value is not None]
    if non_default_kwargs:
        raise NotImplementedError(f"some nice error message with {non_default_kwargs}")

jorisvandenbossche commented 7 months ago

As for always taking **kwargs, this looks reasonable to me

Do we then require that the producer raise an error if unknown kwargs / kwargs with non-default values are passed? (like your example code snippet to do that) Because if not requiring that, this might make it actually harder for the consumer? (eg if passing requested_device and getting no error, that could either mean it was honored or ignored)

pitrou commented 7 months ago

Do we then require that the producer raise an error if unknown kwargs / kwargs with non-default values are passed? (like your example code snippet to do that)

Yes, we probably want to. If a producer ignores a hypothetical requested_device and returns data on another device, then the consumer might just crash.

jorisvandenbossche commented 7 months ago

Python does that automatically for you (raising an error for an unknown keyword) if not adding **kwargs, so the main benefit of adding the explicit **kwargs is that it would allow a consumer to actually pass a keyword with a default value without getting an error, even if the producer didn't yet support that keyword?
I.e. so someone can do obj.__arrow_c_device_array__(requested_device=None) regardless of obj supporting that keyword or not.

jorisvandenbossche commented 7 months ago

As for requested_device, if we think it is needed, then let's just add it immediately?

The specifics for requested_device could then be:

For the default of requested_device=None, the producer should provide the data with the device that the data is currently on.
When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI
The producer should attempt to copy the data to the provided device, and if it cannot do that (eg doesn't support that device), raise an exception (ValueError?) (thus, not "best effort" like requested_schema, but actually raise an error if not possible to fulfill the request)

Does that sound OK?

If we add this keyword, do we keep the notion that a non-CPU library is allowed to also implement the non-device-aware methods like __arrow_c_array__ doing an implicit device-to-CPU copy? Because that can then be covered with the more explicit __arrow_c_device_array__(requested_device=kCPU), although on the other hand it can still be useful for consumers that don't yet support the Device variants of the structs

pitrou commented 7 months ago

Python does that automatically for you (raising an error for an unknown keyword) if not adding **kwargs, so the main benefit of adding the explicit **kwargs is that it would allow a consumer to actually pass a keyword with a default value without getting an error, even if the producer didn't yet support that keyword?

Exactly.

pitrou commented 7 months ago

Does that sound OK?

Hmm, I would rather let @vyasr and @kkraus14 decide what is best here.

paleolimbot commented 7 months ago

Only mentioning this because I just ran across it, but dlpack apparently uses a consumer-provided version:

Starting Python array API standard v2023, a new max_version argument is added to dlpack for the consumer to signal the producer the maximal supported DLPack version.

For what it's worth, I think a producer-defined __arrow_c_api_version__ is the cleanest possible solution: it's very easy to implement for a producer, and very easy for a consumer to generate an informative error (or adjust calling code accordingly). The other ways are fine too, but it seems like they are possibly inventing a brand new way to deal with Python API evolution and I am not sure there is sufficient evidence based on the experience of the array API or dlpack that there is a need to invent a brand new way to do this.

When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI

Would the request also have to specify the device_id?

Because that can then be covered with the more explicit __arrow_c_device_array__(requested_device=kCPU)

From a CPU calling library perspective, I would prefer to only have to deal with __arrow_c_array__(), or possibly __arrow_c_array__(copy=True/False/None) in some future version of that method (since a cross-device copy is not the only type of copy a caller might want to prevent).

Adding requested_device is fine too and I'm in favour of all outcomes that let this interface start being implemented, I am just worried that adding an argument before any implementations exist is going to mean that we risk having to change its interpretation or remove it (defeating the purpose of adding it early).

jorisvandenbossche commented 7 months ago

Note that for DLPack this is about the C ABI version, not the Python interface. So it allows the user to request a certain version of the struct to be put in the returned capsule (DLPack has different versions of the C struct). That's something different as what we are discussing here.

jorisvandenbossche commented 7 months ago

When specified, it accepts integer values corresponding to ArrowDeviceType from the ABI

Would the request also have to specify the device_id?

That's a good question, hopefully @kkraus14 or @vyasr could answer that. If needed, then the argument would be a tuple of two ints for (device_type, device_id) (I see that's also what DLPack does with their dl_device keyword in __dlpack__)

I'm in favour of all outcomes that let this interface start being implemented, I am just worried that adding an argument before any implementations exist is going to mean that we risk having to change its interpretation or remove it (defeating the purpose of adding it early).

Likewise here, happy to spec and implement any agreed proposal, but my personal preference would still be to just start with the simplest option (no keywords, no special kwarg handling for future evolution) with clearly labeling it as experimental, and re-evaluate this in some time after we have initial implementations and usage.

apache / arrow

[Python] Expose the device interface through the Arrow PyCapsule protocol #38325

v1 producer

v2 producer