Closed tylerjereddy closed 8 months ago
I would like to propose to refrain from using "cpu" as an alias for the host (in SYCL terminology), i.e. where the application is started (usually a CPU). At the same time SYCL allows for, and oneAPI DPC++ implements, exposing multi-core CPU as a SYCL device, allowing user to offload data-parallel computation to CPU, as well as to GPUs, or other accelerators known to SYCL runtime.
Bringing data to the host is left out of scope by the standard. I memory serves, most array library implementation provide standalone asnumpy
function, but the standard wants to consider NumPy on the same footings as other array Python libraries.
Would it be agreeable to propose free function asbuffer(array)
which would convert an array object to a Python object that implements Python buffer protocol?
Kokkos also has parallel backends for CPU. I think I'm just asking for any portable way to do it across the libs eventually.
As discussed in the cross-linked PR, another complication is that i.e., cupy.get()
actually gives you a NumPy array rather than staying in the same array namespace like torch
switching from GPU to CPU. Both conversions ultimately give me something that I can use directly with NumPy assert_allclose()
, but that distinction may of course be relevant for design.
The two .. note::
's under https://data-apis.org/array-api/latest/design_topics/device_support.html#intended-usage have quite a bit to say about this exact question.
I think that a utility can indeed be added to array_api_compat
in order to facilitate testing, but it should be clear there that it's separate from the things in the namespaces of implementing array libraries, and the utility will have to be library-specific.
Choosing cupy->numpy seems reasonable for a pragmatic testing utility, but can't be part of the standard I'd say.
This issue came up during the most recent dataframe workgroup meeting (2023-05-23). cc @kkraus14
By the way, it occured to me that we should probably remove 'cpu' from numpy.array_api
, since it might give people the impression that it is portable.
Comment from @fcharras, copied from https://github.com/data-apis/array-api/issues/710#issuecomment-1845425319:
Adding to
@betatim
This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:
x = array_api_compat.torch.asarray([1,2,3]) array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])
The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no
asarray
for normal Python, so this is more of a thought experiment.
and
@oleksandr-pavlyk
We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to
xp.asarray
in another library.
remarks,
should we use a separate issue that covers inter-namespace conversion specifically rather than tolist ?
I want to emphasize with this usecase I have when trying to adapt code for Array API compliance. There is some code that can't compromise on numerical accuracy and absolutly requires at least float64 precision, but I could use an integrated GPU that supports at most float32 (e.g using mps
or xpu
backends using pytorch) for everything else. For this I would have to transfer data from device to cpu, run the float64 compute, and transfer back to device. But .to_device("cpu")
, this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it. .from_dlpack
does not support inter-device conversion so it's not appropriate either.
For this usecase an intermediate object that enable inter-device and inter-namespace conversion surely would be practical.
tolist
have been mentionned but also conversion to and from numpy is commonly supported:
Tensor.numpy
and torch.from_numpy
cupy
have cupy.asnumpy
and cupy.asarray
works with numpy arraysjax.numpy.array
and np.asarray
works with jax arraysfrom_array
supports numpy inputs and np.asarray
works with dask arraysTensor.numpy
and tf.convert_to_tensor()
NDArray.asnumpy
and array
supports numpy inputsasnumpy
and dpctl.tensor.asarray
supports numpy inputswouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy
or asnumpy
? (from_numpy
doesn't seem as necessary since asarray
or from_dlpack
commonly already works with numpy inputs)
wouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy or asnumpy ?
There was some discussion about this (I think on the SciPy repo?) and the conclusion was that the standard is supposed to be completely agnostic - requiring a library to know about and interact with NumPy places NumPy on a pedestal which doesn't make sense for an agnostic standard. Then it would just be _"why not to_torch
etc."_, no?
Then it would just be _"why not
to_torch
etc."_, no?
to_numpy
makes much more sense than to_torch
by far though (commonly installed in a lot of environments, ease of installation (default channels), cpu only, staple of the ecosystem, reasonably lightweight...) so I don't think this question would come up, in the same way that (I think) the other libraries (cupy, jax, mxnet, etc) do not have a to_torch
-like feature although they do have a to_numpy
-like feature. So IMO it makes sense to consider that numpy
is an already commonly adopted tool for portable data container on cpu, and overlook for a moment that it is also an array library.
What we need is a standard way to create an Array instance of lib1 (e.g. numpy) from an Array instance of lib2 (e.g. cupy), possibly with cross-device transfers, even if lib1 only works on host and lib2 only work on device.
Special casing numpy would be fine for us (scikit-learn) but I understand that others would like a more library neutral solution to this problem.
Let me iterate my proposal of to_memoryview
which would copy content of array into an object that implements Python buffer protocol and create memoryview
from it. A NumPy array could be created from memoryview
with zero copy.
This way not a single implementation is singled out. The only drawback, every library must implement one. It could be a syntactic sugar around to_numpy
present in every library as pointed out by @fcharras
to_memmoryview
is a first start but ideally we would also want at least strides and dtype information and even a standard public API to convert from one namespace to another.
@ogrisel strides / dtype are all accessible via buffer protocol / memoryview.
I am confused why we are discussing this, though, when DLPack is basically a superset of the buffer protocol. They do the same thing, we just didn't allow DLPack to make a copy before (see a recent user question at https://github.com/dmlc/dlpack/issues/132).
Since xp.from_dlpack()
is considered an array constructor, I feel this seems to be the easiest solution:
copy
keyword (default to False
, the status quo) to from_dlpack()
, anddl_device
keyword (default to None
, the status quo) to __dlpack__()
Specifically, this allows the following usage:
import numpy as np
import cupy as cp
a_cp = cp.empty((10, 20), dtype=cp.float32)
b_np = np.from_dlpack(a_cp, copy=True)
and similarly
a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True)
The reason we need copy
/dl_device
is to hint the producer that we need to cross the library boundary and so it needs to help us perform necessary actions (i.e., copy) before the consumer can take over. This is needed because a consumer like NumPy is ignorant about GPUs or accelerator devices, and only an accelerator library like dpctl or CuPy can help facilitate this conversion.
def from_dlpack(x, *, copy=False):
dev_type, dev_id = x.__dlpack_device__()
if not copy:
# the existing code path
if dev_type not in what_I_expected:
raise BufferError(...)
else:
# zero-copy conversion
capsule = x.__dlpack__(stream=my_stream(dev_id))
# wrap capsule and return an consumer array
...
else:
# this proposal
if dev_type not in what_I_expected:
capsule = x.__dlpack__(stream=my_stream(dev_id), dl_device=my_dev_type) # e.g. dl_device=kDLCPU
# wrap capsule and return an consumer array
...
else:
# I think we should just raise here, as no copy is needed and this is a perf hit
raise BufferError(...)
This hint can that be taken by the producer to target the right device and make a copy. If the consumer dl_device
is not recognized (ex: we pass kDLMetal
to CuPy), the producer raises an error as usual.
This proposal also clarifies whether from_dlpack()
can make a copy.
(If we don't like the names copy
/dl_device
, we can discuss. The "hint" concept is what's essential here.)
I am slightly disfavor adding to_numpy()
/asnumpy()
to the standard. I feel it's a setback to the very nice story that we presented at SciPy about "NumPy is just another array library, it enjoys no special status."
I like the proposal very much, my first thoughts about from_dlpack
before trying to use it was actually that it already filled this need (I thought that the from_dlpack
spec hints at that with this note:
The returned array may be either a copy or a view. See Data interchange mechanisms for details.
that does already suggest that it can make a copy. Then I became confused about the usage scope that is really intended).
Minor suggestion: copy=True
seems to tell to force a copy regardless of it being necessary. For xp.asarray
copy
can take values True
(force), False
(don't copy, raise an error if impossible), or None
(don't copy, except if it's necessary). The same logic could apply here ?
capsule = x.dlpack(stream=my_stream(dev_id), dl_device=my_dev_type)
But what if the source namespace (the library for x
) does not understand how to handle my_dev_type
? Will it crash?
What would be the public Array API to go through the host as an intermediate step in cases where direct device to device transfer is not possible?
The to_memoryview
solution might be a good way to address the original case in the description of this issue which is about defining a standard way to compare results by 2 array api libraries in tests by moving everything to the host and using numpy.testing.assert_allclose
or similar to do the comparison.
I like @leofang's proposal. The key missing bit though is a cross-library way to identify devices. Leo proposes dl_device
, meaning a simple integer/enum (se https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L74). This does not yet connect to the .device
attribute and existing device objects in libraries. And it's hard to understand for users. I think to get this right, we have to be able to glue this together.
See also the device syntax for torch.device
and jax.Device
, which are already misaligned. And I'm guessing the SYCL version of that is different again?
Maybe a way out is for the device object of each library to gain the same __dlpack_device__
attribute that we came up with for arrays? That way the signature could be
from_dlpack(x, /, *, device=None)
# or with a copy keyword (seems less urgent, the default now is copy=None)
from_dlpack(x, /, *, copy=None, device=None)
and this would allow users to write things like
x2 = from_dlpack(y_lib2, device=x1.device)
There's still no way to write `device='cpu' then, but that isn't a hard necessity probably and would be harder to add (at least, would take more API surface).
@rgommers Note that it's __dlpack__()
accepting dl_device
, not from_dlpack()
. So, this is intended to be managed by the libraries, not by users (it's not user visible, by design).
But, you're right that since from_dlpack()
is an array constructor, it's best to accept a device
keyword. Then it's the consumer library's job to translate the user-provided device
to a dl_device
value that it supports.
For NumPy (or any CPU library), device
can be None
if CPU is the default, or something like "cpu"
that the library already supports (ex: PyTorch).
If device
is added, then it can takes over the role of copy
in my proposal, and whether to add copy
or not is orthogonal to the present mission. Personally, though, I am in favor of adding it. The quote pointed out by @fcharras above
The returned array may be either a copy or a view. See Data interchange mechanisms for details.
is not a very satisfying semantics IMHO. It's better if we can be sure whether the returned array is a fresh copy or a view, and without using a copy
keyword it's not possible to disambiguate. We can fork this issue to discuss further.
capsule = x.__dlpack__(stream=my_stream(dev_id), dl_device=my_dev_type)
But what if the source namespace (the library for
x
) does not understand how to handlemy_dev_type
? Will it crash?
No, nothing will crash. As I already noted, the producer should raise BufferError
if it doesn't know how to handle the consumer's dl_device
.
A BufferError
can already be raised before this discussion by the consumer (when checking the input array's __dlpack_device__
and not understanding the returned dl_device
from the producer). Now we make it symmetric.
What would be the public Array API to go through the host as an intermediate step in cases where direct device to device transfer is not possible?
I think with Ralf's addendum (adding device
to from_dlpack()
) this is not needed anymore? A user can/should specify a proper device
in this case.
But in case it's not addressed, let's consider a hypothetical, extreme case, where the producer holds data on an NVIDIA GPU, and the consumer only supports Intel GPU (for the record I am not aware of such cases 🙂). What you're asking is a double copy (NV -> CPU -> Intel). I feel this should be fully controlled by the user. I would be very reluctant to act smart here, especially when all accelerator libraries try very hard to avoid unnecessary copies.
Note that it's dlpack() accepting dl_device, not from_dlpack(). So, this is intended to be managed by the libraries, not by users (it's not user visible, by design).
Ah yes, I think this is converging - should work like this.
As I already noted, the producer should raise
BufferError
if it doesn't know how to handle the consumer'sdl_device
.
Not sure about this one - I think it should export the data on the current device instead (i.e., no change from today - leave the raising to the consumer). Think about a situation where CuPy (CUDA-only) sees a CPU array and decided to ask the producer for dl_device=CUDA
. If the producer is NumPy, it cannot do that. But if NumPy exports CPU memory, CuPy is still able to consume that (it can do the device transfer on its side). There shouldn't be a need to raise, if dl_device
is considered as a hint like you said.
And +1 to not allowing double transfers, that'd be taking it a bit too far.
Not sure about this one - I think it should export the data on the current device instead (i.e., no change from today - leave the raising to the consumer). Think about a situation where CuPy (CUDA-only) sees a CPU array and decided to ask the producer for dl_device=CUDA. If the producer is NumPy, it cannot do that. But if NumPy exports CPU memory, CuPy is still able to consume that (it can do the device transfer on its side). There shouldn't be a need to raise, if dl_device is considered as a hint like you said.
Yes, this afternoon I was staring at the snippet in my earlier comment
a_np = np.empty((10, 20), dtype=np.float32)
b_cp = cp.from_dlpack(a_np, copy=True)
and realized this loophole too. Good catch, @rgommers! I blame early mornings...🙂
API-wise, I think nothing that we discussed so far need to change. Semantics-wise, we just need a minor correction to help with this case: Instead of
to hint the producer that we need to cross the library boundary
it should be "to hint the producer and consumer that we need to cross the library boundary."
Whoever is capable of handling it should take care of it, but since the producer usually is the one doing most of the work (it takes the stream
and now dl_device
arguments from the consumer and returns a capsule), it should give it a shot first. If the producer fails, then the consumer takes over and retries. We raise BufferError
only if both fail.
Happy to see we're converging! Let us summarize the behavior of from_dlpack()
based on the above discussion. (No pseudo code this time, as the logic becomes a bit convolved. It's easier if I just use bullet points to illustrate.)
For def from_dlpack(x, *, device=None):
device
is None: Consumer uses its default/current device. (status quo)device
is not None: Use device
to determine Consumer device type/ID. (new)BufferError
, then (new)
b. Consumer catches BufferError
, and tries to perform copy, if not working then (new)
c. raise BufferError
to user (new)Note that for Step 2, while it's handy to have a __dlpack_device__()
method defined for the device
instance, it is not required, because it's supposed to be a Consumer device (or something that it recognizes), and Consumer knows how to retrieve the device type/ID from it.
As a thought exercise, let's also see how adding copy
affects the logic. For def from_dlpack(x, *, device=None, copy=False):
device
is None: Consumer uses its default/current device. (status quo)device
is not None: Use device
to determine Consumer device type/ID. (new)copy
is False, Producer device type == Consumer device type, Producer device ID == Consumer device ID:copy
is False, Producer device type == Consumer device type, Producer device ID != Consumer device ID:
a. always raise (~new?)copy
is False, Producer device type != Consumer device type:
a. always raise (status quo)copy
is True, Producer device type == Consumer device type, Producer device ID == Consumer device ID:copy
is True, Producer device type == Consumer device type, Producer device ID != Consumer device ID:
a. make a copy (new)copy
is True, Producer device type != Consumer device type:
a. Producer tries to perform copy, if not working then raise BufferError
, then (new)
b. Consumer catches BufferError
, and tries to perform copy, if not working then (new)
c. raise BufferError
to user (new)This is partly why I feel @fcharras's suggestion https://github.com/data-apis/array-api/issues/626#issuecomment-1852333058 for allowing more than two possible copy
values is a bit challenging to maintain: The logic gets quite complicated.
If everyone is happy, I can create a PR later this week or next.
I would be happy with both of those. I slightly prefer the version with copy
. Regarding copy=None
, now wouldn't it simply amount to wrapping the call with this logic in a try/except
block at the python level:
try:
array_out = from_dlpack(array_in, device=device, copy=False)
except BufferError:
array_out = from_dlpack(array_out, device=device, copy=True)
There are two ways. Either a try/except to have "if needed" copy logic, or allow copy=None/if_needed
with requests, but add a flag for the producer to indicate that the returned array is a copy (which still can mean that with copy=True
it must be a copy and be indicated as such).
To me, the only "device" request anyone should (maybe actually must, since I doubt it is a big burden) support is plain CPU. Beyond that, I don't think there is a point (i.e. up to the implementor, if they want to support moving between GPU and odd-hardware, that is their choice).
Another question is whether there is a need/gain for such a request to support being a set, so that I can say that any of CPU, cuda_managed, ...
is OK (all CPU available).
The alternative would be to be able to query which exports are supported.
Unfortunately, all of this is bloating the spec. It now has multiple requests API and queries additionally to the original exchange; something that it original had not and didn't want to have.
Indeed, at the moment I do not see a use case for direct, heterogeneous device to device transfers without going through the host either. So, if specializing the particular case of "transfer from device to host" can make it simpler to implement and adopt then personally I am fine with this.
3. Producer device type != Consumer device type: a. Producer tries to perform copy, if not working then raise
BufferError
, then (new) b. Consumer catchesBufferError
, and tries to perform copy, if not working then (new) c. raiseBufferError
to user (new)
This does not look quite right to me. If it's a hint and producer cannot satisfy the hint, it should simply export the data on its current device (just like now) - no need to raise an exception or have any extra logic here. And step (c) here is already the status quo.
As a thought exercise, let's also see how adding
copy
affects the logic. Fordef from_dlpack(x, *, device=None, copy=False):
That seems like a bit too much. copy
should be reserved for its current meaning I think, namely a memory copy on the same device. Data transfer between devices is much more expensive typically, and so far we have treated that as an orthogonal concept.
To me, the only "device" request anyone should (maybe actually must, since I doubt it is a big burden) support is plain CPU. Beyond that, I don't think there is a point (i.e. up to the implementor, if they want to support moving between GPU and odd-hardware, that is their choice).
It's actually also helpful for safety when using stream=
, because right now the producer must assume that the consumer did everything correctly and that the stream number is correct for the device that the data is already on. Receiving both stream number and device type is more robust.
Regarding
copy=None
, now wouldn't it simply amount to wrapping the call with this logic in atry/except
block at the python level
@fcharras See also #721. Maybe let's move the copy
discussion there?
- Producer device type != Consumer device type: a. Producer tries to perform copy, if not working then raise
BufferError
, then (new) b. Consumer catchesBufferError
, and tries to perform copy, if not working then (new) c. raiseBufferError
to user (new)This does not look quite right to me. If it's a hint and producer cannot satisfy the hint, it should simply export the data on its current device (just like now) - no need to raise an exception or have any extra logic here. And step (c) here is already the status quo.
Sorry, @rgommers, this is a limitation of not writing actual code to deliver the concept...
What I meant "tries to perform copy" for step b is: Consumer asks Producer to again return a capsule, but this time without passing its dl_device
to __dlpack__()
. If you think about it, this is the only way to achieve what you said ("simply export the data just like now"), right? Without getting a capsule from Croducer, Consumer can't do anything. Catching an exception from Producer is merely a way to signal Consumer to take action.
In terms of code:
try:
capsule = x.__dlpack__(stream=stream, dl_device=dl_device)
except BufferError:
# this should work as it does today
# also, since Producer can't handle dl_device, we may not have a valid stream here?
capsule = x.__dlpack__(stream=stream)
Unfortunately, all of this is bloating the spec. It now has multiple requests API and queries additionally to the original exchange; something that it original had not and didn't want to have.
@seberg yes and no.
So I wouldn't say the spec is bloated, it's a gap being filled 😄
That seems like a bit too much.
copy
should be reserved for its current meaning I think, namely a memory copy on the same device. Data transfer between devices is much more expensive typically, and so far we have treated that as an orthogonal concept.
@rgommers. As I said, it was just an exercise. But, I disagree that copy
is only for the same device. Recall the position that from_dlpack()
is an array creation function, so in terms of semantics for device
and copy
it should stay the same as those of asarray()
. Not honoring is more confusing IMHO.
It's actually also helpful for safety when using
stream=
, because right now the producer must assume that the consumer did everything correctly and that the stream number is correct for the device that the data is already on. Receiving both stream number and device type is more robust.
Agreed. Even if zero-copy is not feasible and Producer has to perform a copy, we want it to do it asynchronously if possible, and to do so we need a stream/queue.
I still think that there should/must(?) be a flag to indicate if the return is a copy or not. I also think it would make sense to have a nocopy
"request" API since asarray
at least has the ambition to have that. (nocopy, since apparently copies were never strictly forbidden.)
I still think that there should/must(?) be a flag to indicate if the return is a copy or not.
Why? It semantically does not matter, it is only a performance issue. Remember there's no concept of a view.
@leofang and @rgommers proposal to add device
keyword to from_dlpack
also open up possibility for between frameworks interoperability, e.g. oneAPI implementation may import data with kDCUDAGPU
, provided the appropriate plugin is available.
This would be the case where oneAPI can deal with capsule as originally given by producer. For CuPy to import oneAPI capsule, DPC++ calls must be made to unbox SYCL objects to native objects, and hence CuPy would need to ask oneAPI library to give kDCUDAGPU
DLPack after recognizing that it can not deal with kDLOneAPI
one directly. Provided that succeeds, it would still be a zero copy transfer.
We should give the user control where it is OK to perform data transfer via host (is that copy
keyword?). It is better to be explicit here. We also need to realize that the copy via host may not be readily available across implementations for some time.
Why? It semantically does not matter, it is only a performance issue.
copy=True/False/None
in asarray()
and without that you cannot mirror it in for from_dlpack()
. At least not without defensive copies. And to me it seems like a very reasonable request to add such a kwarg.So even if pure array-api use may never care, other users (which may not be classical array-libraries at all) may care more.
That's a good argument I think.
Because we have
copy=True/False/None
inasarray()
and without that you cannot mirror it in forfrom_dlpack()
. At least not without defensive copies. And to me it seems like a very reasonable request to add such a kwarg.
Ah wait, I think I misread your comment about a "flag". If you meant "+1 to adding a copy
keyword", then sure. What I understood was a request to let the producer pass back a new flag to say if a copy was actually made if the consumer asked for None (as-needed).
No, I didn't just mean +1 for copy kwarg. If we add a nocopy
"request" (which may be encoded as a copy
kwarg on __dlpack__
) then I additionally want a new flag (additional to readonly, etc.) that encodes copy or not (or maybe it should be view or not?).
It is just such a trivial addition and in numpy we are annoyed about missing it with asarray(copy=True/False/None)
having to be very defensive in both erroring or doing additional copies.
E.g. also think of out=
kwargs which must know that it is a view! Thus, I want to make sure this flag isn't forgotten when the rest of this is added.
Some discussion today about whether it is useful to add a device ID in the hint, rather than only a device type. Wasn't entirely clear whether there was a strong enough use case - something that would need a bit more thought perhaps.
then I additionally want a new flag (additional to readonly, etc.) that encodes copy or not (or maybe it should be view or not?).
Okay - seems like that would be a DLPack C level change first, right next to the readonly flag.
Since Leo asked if we can push forward with this, I created: https://github.com/dmlc/dlpack/issues/134. I would strongly prefer pushing both hand-in-hand, because otherwise things tend to get lost and bite us later. But it is a trivial change in DLPack now that we have flags.
Since Leo asked if we can push forward with this, I created: dmlc/dlpack#134. I would strongly prefer pushing both hand-in-hand, because otherwise things tend to get lost and bite us later. But it is a trivial change in DLPack now that we have flags.
Yes, I agreed with the assessment. Created #741 and https://github.com/dmlc/dlpack/pull/136 to address this issue.
This came up in https://github.com/data-apis/array-api-compat/pull/40.
My particular use case is something like this:
x.to_device("cpu")
to allow downstream library test suites to guarantee that an array is returned to host (or no-op if already there) for usage in i.e.,assert_allclose(x, expected)
whereexpected
is a NumPy array (always on host) butx
could be on some other device depending on the array lib/backend.I think specification to that level of detail is left to the libs at the moment. CuPy doesn't even have
to_device()
at the moment let alonecpu
, instead requiring usage of i.e.,arr.get()
to return to host, but even if they did, I'm not so sure the current array API spec would mandate that they need to allowcpu
as a valid argument.Maybe the decision is that we can't really require that level of detail, in which case I can at least point to this issue when justifying shims to get around it when testing. If it were possible to mandate a common call signature for returning to host someday, hopefully the benefit of that would be fairly obvious re: testing multiple backends.