Support `copy` and `device` keywords in `from_dlpack`

leofang commented 4 months ago

~~Blocked by #602.~~

Close #626. As discussed there, we expand the DLPack data exchange protocol to formalize copies. In particular, we allow two common needs:

Explicitly enable/disable data copies (through the new copy argument). The default is still None ("may copy") to preserve backward compatibility and align with other constructors (ex: asarray).
Allow copying to CPU (through the new device argument, plus setting copy=True)

In principle, arbitrary cross-device copies could be allowed too, but the consensus in #626 was that limiting to device-to-host copies is enough for now. This also avoids the needs of ~~actual~~ exhaustive library support and clarifying unforeseen semantics issues.

This PR requires an accompanying PR to the DLPack repo (https://github.com/dmlc/dlpack/pull/136), in order for the producer to signal if a copy has been made (through a new bit mask), so please review both together.

leofang commented 4 months ago

Note: the actual commits start at 0083b7a; the rest was from #602.

leofang commented 4 months ago

cc: @rgommers @seberg @tqchen @oleksandr-pavlyk

rgommers commented 4 months ago

Thanks Leo! Should I take over the first two commits in gh-602?

leofang commented 4 months ago

Let's just review/merge #602 first, this branch was based on the one there.

rgommers commented 4 months ago

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).

In that discussion I was perhaps missing a more precise problem statement. Also note that that HackMD link isn't public, so here is what it says for potential solutions:

@seberg suggested: The important thing is that stream=None (or not passing) synchronizes if you specify device="cpu". What happens if a stream is passed, seems not to important to me the consumer passed it, so they must expect the right thing.

@leofang suggested: piggyback on stream=None to request sync copy, when device is not the same. Solution 2: add stream=-2 to signal the two devices do not have a common language (stream/queue).

I had a look at what PyTorch does, and it's basically the same as @seberg's suggestion I believe. From https://pytorch.org/docs/master/notes/cuda.html#asynchronous-execution: In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously. It goes on to explain a non_blocking keyword, but that's not relevant to the kDLCPU device (only pinned memory).

I am starting to think that defining synchronization based on the target device (i.e. the one passed in) may have some advantages:

Since we only want to enable copying to host (where there is no stream concept) at this moment, do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

seberg commented 4 months ago

do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.

Well, there are two things we need, I guess:

We need to specify that device="cpu" without passing a stream must synchronize fully.
We shouldn't have random decisions about what a passed stream and device means. So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

Streams really should always have been fully type safe, but OK (which would resolve this, we could just say: one of them, figure it out based on type).

Maybe I should change my proposal to, this: Streams belong to the target device unless the producer knows it cannot possibly belong to it (the stream is typed to be a CudaStream, say an integer subclass). The important part being here that the default must be in terms of the target!

rgommers commented 4 months ago

So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if device="cpu", stream=123123 actually means a cuda stream and cupy must not hope it does.

I'd say that:

The description of the stream parameter in array_api.array.__dlpack__.html seems pretty clear that the stream number is that of the consumer.
Providing a stream number for a "to host" operation is nonsensical input I'd say, which we tend not to specify. If anything, let's make it even clearer that (1) is the case. However, I don't think this can be much clearer: The ownership of the stream stays with the consumer. On CPU and other device types without streams, only None is accepted.

Since only None is accepted, changing that to now require something else (like -2) looks like an unnecessary backwards compat break.

We need to specify that device="cpu" without passing a stream must synchronize fully.

That sounds good to add. Needs to avoid the string "cpu" and use text like "when the device= keyword receives an argument representing the CPU/host device".

seberg commented 4 months ago

Well, the more I think about it the more I think it makes sense to say:

If dl_device= is passed the stream must refer to the specified target devices definition for streams. For example if the dl_device=CPU only stream=None is valid and full synchronization must occur even if the data is currently on a CUDA device. A producer may ignore the stream passed and opt to the safe default synchronization of the target device.

That covers what we need, and in principle allows a cpu -> gpu exporter to use a stream. A gpu -> cpu can also use stream if the CPU is specified as CUDAManaged or CUDAHost. The main thing is: It doesn't leave an odd gap in the spec.

Now there might be a point for more complex synchronization schemes in the future. But maybe that just needs a future extension. (The annoyance is that it would be nice to not need a try/except for such a future extension, but beggars can't be choosers)

leofang commented 4 months ago

I see that you are continuing the discussion we had during the meeting on the subject

When two libraries exchange data via copying, potentially across devices, there might not be a common stream object that can be recognized and used by both libraries.

I think Sebastian is right to conclude that

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

The stream object and the dl_device object must go in as a valid pair. One subtlety is that while it is in theory possible for a D2H copy to be asynchronously performed on a stream, with the current "one-way" DLPack protocol there is no safe way to allow this. DLPack is one-way in the sense that during handshaking only the consumer would offer stream (and now dl_device too, with this PR) to the producer, not the other way around.[^1]

[^1]: It would be a two-way handshaking if both parties offer their dl_device and stream for negotiation -- then even D2H copies can be done asynchronously, by a bunch of complex logic checking device compatibility and establishing ordering. (By the way, when I developed the draft and exchanged with Sebastian offline, I somehow mistakenly assumed we had two-way handshaking, or at least a part of it -- that a consumer can query what devices the producer supports.) But the ship has sailed and I guess it'd make the implementation very complicated that it's probably not worth it, even if we could start DLPack over from scratch.

Given no major concern with this PR based on the discussions in the original issue, in the call, and above, let me proceed to make necessary changes to refine this PR.

Also note that that HackMD link isn't public, so here is what it says ...

Interesting, I didn't know that. I thought we just need to log in HackMD to view it?

rgommers commented 4 months ago

or at least a part of it -- that a consumer can query what devices the producer supports.

The new introspection API should allow that, but I agree that it's probably not worth it to go there.

I thought we just need to log in HackMD to view it?

No, HackMD has permissions. It's also transient; for any issue/PR with a relevant discussion we should be posting a summary on GitHub.

If dl_device= is passed the stream must refer to the specified target devices definition for streams.

Sounds right. One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

rgommers commented 4 months ago

pre-commit and Sphinx have some very minor complaints, other than that this looks pretty good to me now. I guess it'll stay in draft until there is movement on https://github.com/dmlc/dlpack/pull/136 (since this PR explicitly mentions setting a to-be-added C level flag)?

oleksandr-pavlyk commented 4 months ago

May I suggest a change:

The v2023.12 standard only mandates that a compliant library must offer a \
    way for ``from_dlpack`` to create an array on CPU (using ...

to

The v2023.12 standard only mandates that a compliant library must offer a \
    way for ``from_dlpack`` to create an array whose underlying memory is \ 
    accessible to the Python interpreter (using ...

I want to reiterate that while this means transferring data to colloquial "CPU", the notion of CPU device is overloaded. In oneAPI, a CPU device refers to SYCL device which adheres to SYCL execution model, memory model and programming model, and is different from host device, which refers to physical CPU device and memory system free of these restrictions.

I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process. Opinions are welcome.

rgommers commented 4 months ago

That looks like a good suggestion to me, thanks @oleksandr-pavlyk

leofang commented 4 months ago

Is there a timeline issue here as well? from_dlpack should do nothing with the new copy and device keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?

In the later revision, this was clarified that a try-except should be used to check if copy and device keywords are supported.

One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack changes, but the __dlpack__ changes are more general. Is that on purpose (if so, I don't mind)?

Indeed, yes. I was thinking we want to limit user-visible effects. If array library maintainers think they can get creative in __dlpack__ in a way that wouldn't interfere with future evolutions of the array API standard, they can do so.

leofang commented 4 months ago

I guess it'll stay in draft until there is movement on dmlc/dlpack#136 (since this PR explicitly mentions setting a to-be-added C level flag)?

That's the part I am not sure... The contents in both the dlpack PR and this PR are cross referenced and intertwined. I am fine to go either way, once @tqchen has time to take a look and offer us feedbacks. I can also break the dlpack PR into two (one is to only add the new flag, another is updating the C/Python docs) if it helps, but I wouldn't do so unless Tianqi agrees that it would help.

I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process.

Applied in commit 338742a.

leofang commented 4 months ago

(Requested Aaron's review from the array-api-compat perspective.)

tqchen commented 4 months ago

Thanks for looping me in. Personally, I find there are two goals in here which are slightly intermingled

G0: The ability to do zero copy exchange
G1: The ability to cross device copy

My main thought is that we should keep API support for G0 to be simple and ensure from_dlpack to come with clear intent as in zero copy.

Because G1 might indeed opens up significant implementation overhead, as well as different possible ways of synchronization that I think would merit a different API(to_device?).

My main question is how to phase out the API, so most libraries can start with G0. But happy to hear about others thoughts as well

seberg commented 4 months ago

My main question is how to phase out the API, so most libraries can start with G0

Nobody is required to implement the cross device API, we could bump minor API (so consumers can know whether or not the flag should be set). Flagging IS_COPY (and a code change) is thus only needed if the library is atypical and doesn't use zero-copy exchange.

For the Array-API, this is a bit different: because I think there the intention is to additionally require dl_device="cpu" to work. A maybe slightly overpowered way to allow comparing two libraries in tests (e.g. by converting both to the same CPU library, like NumPy).

rgommers commented 4 months ago

Personally, I find there are two goals in here which are slightly intermingled

G0: The ability to do zero copy exchange

G1: The ability to cross device copy

That is the right framing I think indeed.

I think would merit a different API(to_device?).

I think a separate API would be more problematic to do, because:

If two libraries don't support the same devices, anything like to_device will not work. Asking a library like CuPy to return a DLPack with host memory for under-the-hood usage is one thing; asking it to do so in a public/user-facing API looks like a nonstarter.
DLPack is the only candidate for implementing this under the hood; the device support is needed and related issues like streams have been worked out. API-wise a new function not called from_dlpack but still doing everything through DLPack would lead to a strange asymmetry.

so most libraries can start with G0

I think one should always start there if that's an option. from_dlpack(y, device=x.device) will be zero-copy if the data is already on the same device. If we really wanted to ensure always-zero-copy by default, then the copy keyword could default to False rather than None; that way from_dlpack(y, device=x.device) would never be useful and authors would have to write from_dlpack(y, copy=None, device=x.device) (or copy=True).

Does it really matter though if this is using from_dlpack with the device keyword or some other function like xp.from_dlpack_with_devicetransfer? The former seems preferable, and I don't think it detracts from the "prefer zero-copy" nature of DLPack.

tqchen commented 4 months ago

Thanks @rgommers for the comments. I think as long as packages can phase their implementation status this would be fine.

leofang commented 4 months ago

Thanks @tqchen @rgommers @seberg for the comments. Indeed, G0 (zero copy) has been the goal and we've achieved that since the first version of the standard. What's new with this PR is

we also enable G1 (cross-device copy) using DLPack (for which Ralf gave the rationale from #626; tl;dr is DLPack offers all the info we need to facilitate this, so no reason not use it), and
use the kw arg copy to enforce G0 if needed (setting copy=False will ensure we never copy, a guarantee we couldn't have before).

leofang commented 4 months ago

Q: Does anyone know why the doc build has a warning?

/home/circleci/repo/src/array_api_stubs/_draft/array_object.py:docstring of array_api_stubs._draft.array_object._array.__dlpack__:1: WARNING: py:class reference target not found: Enum

I thought Enum has been used in the return value type before and it worked just fine?

rgommers commented 4 months ago

I thought Enum has been used in the return value type before and it worked just fine?

Those warnings are a real pain. I pushed a fix. For why it was needed here and not in __dlpack_device__ right below it, I have no explanation beyond "autodoc is buggy"

leofang commented 4 months ago

Reminder to all: Remember to review https://github.com/dmlc/dlpack/pull/136 too, which goes hand-in-hand with this PR.

leofang commented 4 months ago

The accompanying DLPack PR was merged, and DLPack 1.0rc was released. Let's merge this. Thanks everyone for the discussion!

data-apis / array-api

Support `copy` and `device` keywords in `from_dlpack` #741