Closed leofang closed 4 months ago
cc: @rgommers @seberg @tqchen @oleksandr-pavlyk
Thanks Leo! Should I take over the first two commits in gh-602?
Let's just review/merge #602 first, this branch was based on the one there.
When two libraries exchange data via copying, potentially across devices, there might not be a common
stream
object that can be recognized and used by both libraries. The discussed solutions involve different ways of requesting synchronous copies (see HackMD).
In that discussion I was perhaps missing a more precise problem statement. Also note that that HackMD link isn't public, so here is what it says for potential solutions:
@seberg suggested: The important thing is that stream=None
(or not passing) synchronizes if you specify device="cpu". What happens if a stream is passed, seems not to important to me the consumer passed it, so they must expect the right thing.
@leofang suggested: piggyback on stream=None
to request sync copy, when device is not the same. Solution 2: add stream=-2
to signal the two devices do not have a common language (stream/queue).
I had a look at what PyTorch does, and it's basically the same as @seberg's suggestion I believe. From https://pytorch.org/docs/master/notes/cuda.html#asynchronous-execution: In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously. It goes on to explain a non_blocking
keyword, but that's not relevant to the kDLCPU
device (only pinned memory).
I am starting to think that defining synchronization based on the target device (i.e. the one passed in) may have some advantages:
Since we only want to enable copying to host (where there is no stream concept) at this moment, do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.
do we really need to go into this for the purposes of getting this PR in? It looks to me like this can be deferred.
Well, there are two things we need, I guess:
device="cpu"
without passing a stream must synchronize fully.device="cpu", stream=123123
actually means a cuda stream and cupy must not hope it does.Streams really should always have been fully type safe, but OK (which would resolve this, we could just say: one of them, figure it out based on type).
Maybe I should change my proposal to, this: Streams belong to the target device unless the producer knows it cannot possibly belong to it (the stream is typed to be a CudaStream
, say an integer subclass). The important part being here that the default must be in terms of the target!
So arguably should say that a producer must ignore the stream object for the time being unless they are 100% sure that UB is impossible for some reason. I.e. as of now it is unclear if
device="cpu", stream=123123
actually means a cuda stream and cupy must not hope it does.
I'd say that:
stream
parameter in array_api.array.__dlpack__.html
seems pretty clear that the stream number is that of the consumer. None
is accepted.Since only None
is accepted, changing that to now require something else (like -2) looks like an unnecessary backwards compat break.
We need to specify that
device="cpu"
without passing a stream must synchronize fully.
That sounds good to add. Needs to avoid the string "cpu"
and use text like "when the device=
keyword receives an argument representing the CPU/host device".
Well, the more I think about it the more I think it makes sense to say:
If dl_device=
is passed the stream must refer to the specified target devices definition for streams. For example if the dl_device=CPU
only stream=None
is valid and full synchronization must occur even if the data is currently on a CUDA device.
A producer may ignore the stream passed and opt to the safe default synchronization of the target device.
That covers what we need, and in principle allows a cpu -> gpu
exporter to use a stream. A gpu -> cpu
can also use stream if the CPU is specified as CUDAManaged
or CUDAHost
.
The main thing is: It doesn't leave an odd gap in the spec.
Now there might be a point for more complex synchronization schemes in the future. But maybe that just needs a future extension. (The annoyance is that it would be nice to not need a try/except for such a future extension, but beggars can't be choosers)
I see that you are continuing the discussion we had during the meeting on the subject
When two libraries exchange data via copying, potentially across devices, there might not be a common
stream
object that can be recognized and used by both libraries.
I think Sebastian is right to conclude that
If
dl_device=
is passed the stream must refer to the specified target devices definition for streams.
The stream
object and the dl_device
object must go in as a valid pair. One subtlety is that while it is in theory possible for a D2H copy to be asynchronously performed on a stream, with the current "one-way" DLPack protocol there is no safe way to allow this. DLPack is one-way in the sense that during handshaking only the consumer would offer stream
(and now dl_device
too, with this PR) to the producer, not the other way around.[^1]
[^1]: It would be a two-way handshaking if both parties offer their dl_device
and stream
for negotiation -- then even D2H copies can be done asynchronously, by a bunch of complex logic checking device compatibility and establishing ordering. (By the way, when I developed the draft and exchanged with Sebastian offline, I somehow mistakenly assumed we had two-way handshaking, or at least a part of it -- that a consumer can query what devices the producer supports.) But the ship has sailed and I guess it'd make the implementation very complicated that it's probably not worth it, even if we could start DLPack over from scratch.
Given no major concern with this PR based on the discussions in the original issue, in the call, and above, let me proceed to make necessary changes to refine this PR.
Also note that that HackMD link isn't public, so here is what it says ...
Interesting, I didn't know that. I thought we just need to log in HackMD to view it?
or at least a part of it -- that a consumer can query what devices the producer supports.
The new introspection API should allow that, but I agree that it's probably not worth it to go there.
I thought we just need to log in HackMD to view it?
No, HackMD has permissions. It's also transient; for any issue/PR with a relevant discussion we should be posting a summary on GitHub.
If
dl_device=
is passed the stream must refer to the specified target devices definition for streams.
Sounds right. One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the from_dlpack
changes, but the __dlpack__
changes are more general. Is that on purpose (if so, I don't mind)?
pre-commit and Sphinx have some very minor complaints, other than that this looks pretty good to me now. I guess it'll stay in draft until there is movement on https://github.com/dmlc/dlpack/pull/136 (since this PR explicitly mentions setting a to-be-added C level flag)?
May I suggest a change:
The v2023.12 standard only mandates that a compliant library must offer a \
way for ``from_dlpack`` to create an array on CPU (using ...
to
The v2023.12 standard only mandates that a compliant library must offer a \
way for ``from_dlpack`` to create an array whose underlying memory is \
accessible to the Python interpreter (using ...
I want to reiterate that while this means transferring data to colloquial "CPU", the notion of CPU device is overloaded. In oneAPI, a CPU device refers to SYCL device which adheres to SYCL execution model, memory model and programming model, and is different from host device, which refers to physical CPU device and memory system free of these restrictions.
I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process. Opinions are welcome.
That looks like a good suggestion to me, thanks @oleksandr-pavlyk
Is there a timeline issue here as well?
from_dlpack
should do nothing with the newcopy
anddevice
keywords beyond forwarding them to the producer. So if the producer doesn't support that functionality yet, what happens?
In the later revision, this was clarified that a try-except
should be used to check if copy
and device
keywords are supported.
One thing that stood out to me related to this in this PR is that the "only CPU/host is supported" is part of the
from_dlpack
changes, but the__dlpack__
changes are more general. Is that on purpose (if so, I don't mind)?
Indeed, yes. I was thinking we want to limit user-visible effects. If array library maintainers think they can get creative in __dlpack__
in a way that wouldn't interfere with future evolutions of the array API standard, they can do so.
I guess it'll stay in draft until there is movement on dmlc/dlpack#136 (since this PR explicitly mentions setting a to-be-added C level flag)?
That's the part I am not sure... The contents in both the dlpack PR and this PR are cross referenced and intertwined. I am fine to go either way, once @tqchen has time to take a look and offer us feedbacks. I can also break the dlpack PR into two (one is to only add the new flag, another is updating the C/Python docs) if it helps, but I wouldn't do so unless Tianqi agrees that it would help.
I want the verbiage used in the spec to reflect the essential requirement mandated by the standard to make data available to Python interpreter process.
Applied in commit 338742a.
(Requested Aaron's review from the array-api-compat perspective.)
Thanks for looping me in. Personally, I find there are two goals in here which are slightly intermingled
My main thought is that we should keep API support for G0 to be simple and ensure from_dlpack
to come with clear intent as in zero copy.
Because G1 might indeed opens up significant implementation overhead, as well as different possible ways of synchronization that I think would merit a different API(to_device
?).
My main question is how to phase out the API, so most libraries can start with G0. But happy to hear about others thoughts as well
My main question is how to phase out the API, so most libraries can start with G0
Nobody is required to implement the cross device API, we could bump minor API (so consumers can know whether or not the flag should be set). Flagging IS_COPY
(and a code change) is thus only needed if the library is atypical and doesn't use zero-copy exchange.
For the Array-API, this is a bit different: because I think there the intention is to additionally require dl_device="cpu"
to work. A maybe slightly overpowered way to allow comparing two libraries in tests (e.g. by converting both to the same CPU library, like NumPy).
Personally, I find there are two goals in here which are slightly intermingled
- G0: The ability to do zero copy exchange
- G1: The ability to cross device copy
That is the right framing I think indeed.
I think would merit a different API(to_device?).
I think a separate API would be more problematic to do, because:
to_device
will not work. Asking a library like CuPy to return a DLPack with host memory for under-the-hood usage is one thing; asking it to do so in a public/user-facing API looks like a nonstarter.from_dlpack
but still doing everything through DLPack would lead to a strange asymmetry.so most libraries can start with G0
I think one should always start there if that's an option. from_dlpack(y, device=x.device)
will be zero-copy if the data is already on the same device. If we really wanted to ensure always-zero-copy by default, then the copy
keyword could default to False
rather than None
; that way from_dlpack(y, device=x.device)
would never be useful and authors would have to write from_dlpack(y, copy=None, device=x.device)
(or copy=True
).
Does it really matter though if this is using from_dlpack
with the device
keyword or some other function like xp.from_dlpack_with_devicetransfer
? The former seems preferable, and I don't think it detracts from the "prefer zero-copy" nature of DLPack.
Thanks @rgommers for the comments. I think as long as packages can phase their implementation status this would be fine.
Thanks @tqchen @rgommers @seberg for the comments. Indeed, G0 (zero copy) has been the goal and we've achieved that since the first version of the standard. What's new with this PR is
copy
to enforce G0 if needed (setting copy=False
will ensure we never copy, a guarantee we couldn't have before).Q: Does anyone know why the doc build has a warning?
/home/circleci/repo/src/array_api_stubs/_draft/array_object.py:docstring of array_api_stubs._draft.array_object._array.__dlpack__:1: WARNING: py:class reference target not found: Enum
I thought Enum
has been used in the return value type before and it worked just fine?
I thought
Enum
has been used in the return value type before and it worked just fine?
Those warnings are a real pain. I pushed a fix. For why it was needed here and not in __dlpack_device__
right below it, I have no explanation beyond "autodoc is buggy"
Reminder to all: Remember to review https://github.com/dmlc/dlpack/pull/136 too, which goes hand-in-hand with this PR.
The accompanying DLPack PR was merged, and DLPack 1.0rc was released. Let's merge this. Thanks everyone for the discussion!
Blocked by #602.Close #626. As discussed there, we expand the DLPack data exchange protocol to formalize copies. In particular, we allow two common needs:
copy
argument). The default is stillNone
("may copy") to preserve backward compatibility and align with other constructors (ex:asarray
).device
argument, plus settingcopy=True
)In principle, arbitrary cross-device copies could be allowed too, but the consensus in #626 was that limiting to device-to-host copies is enough for now. This also avoids the needs of
actualexhaustive library support and clarifying unforeseen semantics issues.This PR requires an accompanying PR to the DLPack repo (https://github.com/dmlc/dlpack/pull/136), in order for the producer to signal if a copy has been made (through a new bit mask), so please review both together.