[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange

tqchen commented 4 years ago

In order for an ndarray system to interact with a variety of frameworks, a stable in-memory data structure is needed.

DLPack is one such data structure that allows exchange between major frameworks. It is developed with inputs from many deep learning system core developers. Highlights include:

Minimum and stable: simple header
- The spec has stayed roughly unchanged for more than four years.
Designed for cross hardware: CPU, CUDA, OpenCL, Vulkan, ROCm, Hexagon
Already a "standard" with wide community adoption and support, ones that I am aware of:
- Frameworks, tensorflow/jax, pytorch, mxnet
- Libraries: dgl, spaCy etc.
- Compilers: TVM
Clean C ABI compatible
- Means you can create and access it from any language
- It is also essential for building JIT and AOT compilers to support these data types.
High performance consideration
- Data field mandatory aligns to 256 bytes(for aligned load), allow byte_offset to offset the array if necessary

The main design rationale of DLPack is the minimalism. DLPack drops the consideration of allocator, device API and focus on the minimum data structure. While still considering the need for cross hardware support(e.g. the data field is opaque for platforms that does not support normal addressing).

It also simplifies some of the design to remove legacy issues(e.g. everything assumes to be row major, strides can be used to support other case, and avoid the complexity to consider more layouts)

After building the frameworks around the related data structures for a while, and see ecosystem grows around it, I am quite convinced that DLPack should be one important candidate, if not the best one for the C ABI array data structure.

Given that array exchange is one goal of the consortium, it would be great to see if dlpack can be used as the stable C ABI structure for array exchange.

If the proposal receives positive response from the community. We would be more than happy to explore options to build a neutral, open governance(e.g. like the apache model) that continues to oversees the evolution of the spec -- for example, donate the dlpack to the data-api consortium or host it in a clean github org.

shoyer commented 4 years ago

Thanks for bringing up this suggestion! DLPack does look quite promising for array interoperability.

From the perspective of Python data APIs, one aspect of DLPack that is not clear to me is how to use it at the level of Python/CPython objects, i.e., the equivalent of __array_interface__, __cuda_array_interface__ and/or Python's buffer protocol.

Does DLPack even expose a Python object for wrapping DLPack tensors? It looks like right now JAX and PyTorch just use PyCapsule objects? That's probably fine but worth standardizing.

szha commented 4 years ago

Does DLPack even expose a Python object for wrapping DLPack tensors?

not yet, though it's fairly straightforward to expose one. here's one through ctypes

https://github.com/apache/incubator-mxnet/blob/2610c10701c2b8155dbf094aaecba37ebbf67d0f/python/mxnet/dlpack.py#L63-L81

the equivalent of __array_interface__, __cuda_array_interface__ and/or Python's buffer protocol.

For dlpack, there are two main differences from array interfaces (see https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h#L132-L148):

coordination of writing
the data descriptor for complex data types

I believe the former is intentional so that it's easier to conform. The later can (and I think should) be extended in dlpack.

tqchen commented 4 years ago

Right now most of the frameworks we know already conforms the convention by PyTorch/Jax/TF/TVM (these APIs are in python), see for example,

Framework can export an DLPack object in PyCapsule
The PyCapsule can be consumed exactly once (think of move semantics in C++ and Rust)
If the PyCapsule is not consumed, the deleter of DLPack will be called during destruction of the PyCapsule
If the PyCapsule is consumed
- the consumer will mark the PyCapsule as "used_dltensor" (This is the current convention used by most frameworks)
- Alternatively, we can also directly change the deleter of the consumed PyCapsule to None

Example APIs

Complex Number

Thanks @szha on the comment. We could certainly fold the complex data type as part of DLDataType. However, that might be an interesting topic that can could need a bit more discussion.

The main reason is because there are quite a few ways complex number can be stored(e.g. array of struct vs struct of array) for performance reasons, and different frameworks/HW might choose different approach. A more thorough discussion might be necessary

aregm commented 4 years ago

@tqchen for 1-dim arrays what is the difference between DLPack and Arrow format?

byronyi commented 4 years ago

DLPack drops the consideration of allocator, device API and focus on the minimum data structure.

I would suggest to integrate DLPack with the stream executor API, including async malloc/dealloc, fine-grained read/write barriers, etc., which is a de-facto standard in high performance training frameworks.

Without proper compute/transfer stream synchronization between frameworks, pretending accessing the array in device memory space is the same as accessing host memory causes either overhead of global barriers or memory inconsistency for DLPack arrays.

tqchen commented 4 years ago

Thanks @byronyi for the comment about async device stream support, this is something that we have thought very careful about.

This is a design tradeoff in terms of how many parts people want to standarize, vs how many part are left over to the frameworks themselves.

Most of the deep learning framework has their own internal mechanism for managing aync device computations: for example MXNet has the dependency scheduler, TF-RT has its own scheduler(that relies on its internal future system).

While it is totally possible to introduce a broader scope API standardization, by incorporating the stream/executor scheduling. The result is the cost of more standardization, and harder adoption from the frameworks -- what if framework A comes up with another graph scheduler that is faster than the vanilla stream executor? (This is totally possible).

So the rationale is given that the allocator / async scheduler part is a bulk piece that is still evolving, we take a more conservative approach by only standardizing the part we can agree on -- namely the data structures.

This does not prevent frameworks to still agree on additional conventions during exchange, for example, if the pytorch and TVM uses the same CUDA stream during exchange, there is no need for barriers in synchronization. In many cases, agreeing to the default convention is good enough as a compromise -- for example, usually sync to default CUDA stream is not a bad choice.

Now, it is certainly possible to introduce additional layers of standardization of allocator, or scheduler on top of DLPack -- since scheduling and data structure are orthogonal. But based on my experience, this part is still in flux and it is relatively harder to get frameworks' agreement.

szha commented 4 years ago

I'd agree with the assessment and we can regard scheduling coordination out of scope for now.

tqchen commented 4 years ago

@aregm wrt to Arrow and DLPack. I believe they are designed with different design goals in mind.

Base on my understanding, Arrow is a good format for dataframe exchange. The key rationale is to represent the data in a compact in-memory format that is also friendly to common dataframe related processing. From that perspective, the meta-data is defined with considerations including things like support for non-POD data types, variable length encoding. etc.

DLPack focuses more on the computation, and support for more hardware variations(due to the background in deep learning system). As a result there are several key design choices that may not be present in arrow's array. Note that these are all subtle but important design decisions (since the representation of POD-type Array can be as simple as the data pointer plus length). Most of the rationales are documented in the DLPack header file as well, I list some of the choices here:

Besides the data pointer file, there is a byte_offset to represent offset to the data pointer. This is to accomodate array slicing when the device data pointer is opaque(does not support host side addressing) , in the case of common accelerators, vulkan and opencl.
Instead of having a plain type code that enumerates over the types (e.g. int8, int32, int64, float32), the data type field is parameteric(support bits, type code and lanes), which allows us to represent vector types like int4x2, this is important to represent basic vector types, especially those in sub-byte category.
- Right now supported base type include float, 'int', 'uint', 'bfloat'(bfloat16 for deep learning accelration)
A context field to represent the device context (include CPU, CUDA, AMDGPU, vulkan, opencl).

rgommers commented 4 years ago

Given that array exchange is one goal of the consortium, it would be great to see if dlpack can be used as the stable C ABI structure for array exchange.

Agreed, this is an interesting topic and fits well with the goals of this consortium. We're starting with Python API standards docs, and I think this would be separate, but makes a lot of sense to treat it in a very similar way.

One of the things DLPack doesn't yet seem to have is docs (except for the README and info in the header) - the content of the conversation in this issue tells me more about purpose, scope, use cases and semantics than what I can find in the DLPack repo.

If the proposal receives positive response from the community. We would be more than happy to explore options to build a neutral, open governance(e.g. like the apache model) that continues to oversees the evolution of the spec -- for example, donate the dlpack to the data-api consortium or host it in a clean github org.

Thanks for mentioning that. It looks to me like the repo with the reference implementation for DLPack is in good hands today, so I wouldn't be in a hurry to move it. If we get consensus on DLPack being standardized, I'd be more inclined to do the docs (including purpose etc. I mentioned above) here, and reference the current repo for implementation.

.... So the rationale is given that the allocator / async scheduler part is a bulk piece that is still evolving, we take a more conservative approach by only standardizing the part we can agree on -- namely the data structures.

This makes perfect sense to me, and is how we approach the Python API standardization as well.

Example APIs

https://pytorch.org/docs/stable/dlpack.html

https://tvm.apache.org/docs/api/python/ndarray.html?highlight=to_dlpack#tvm.nd.NDArray.to_dlpack

I have to say the Python API looks a little awkward to me. Referencing dlpack as a name assumes a level of knowledge from the user that really would be better hidden. Compare with the buffer protocol in Python, which "just works" but is invisible to users - they just call a constructor function like numpy.asarray.

The "consume exactly once" is something that doesn't commonly exist in Python usage right now. I'm thinking of:

In [1]: import numpy as np                                                     

In [2]: import torch                                                           

In [3]: x = np.arange(3)                                                       

In [4]: t = torch.tensor(x)  # copies data                                     

In [5]: t2 = torch.as_tensor(x)  # shares memory                               

In [6]: x[0] = 9                                                               

In [7]: x                                                                      
Out[7]: array([9, 1, 2])

In [8]: t                                                                      
Out[8]: tensor([0, 1, 2])

In [9]: t2                                                                     
Out[9]: tensor([9, 1, 2])

Now we have a third type of construction, which doesn't copy but also doesn't share - instead it consumes. So what comes to mind at the Python level is something like a __dlpack__ method plus a constructor name similar to as_tensor for this behaviour.

tqchen commented 4 years ago

Thanks @rgommers, re docs: agree, the orginal purpose of dlpack is to specify the C data structure, where most of the rationales are documented in the C header file. On the other hand, it would be useful to write down the python API calling conventions, and provide more docs on the area.

Clarification wrt "consume exactly one": It does not mean that we are moving the memory from numpy to torch. Instead, the convention means that the PyCapsule can only be consumed exactly once. The exporter(that calls to_dlpack) still retains the memory.

To rephrase your example using the to_dlpack/from_dlpack in the PyTorch API convention

import numpy as np                                                     
import torch                                                           

x = np.arange(3)                                                       
capsule = np.to_dlpack(x)         
# consumes the capsule
t2 = torch.from_dlpack(capsule)

x[0] = 9
print(t2)
>> tensor([9, 1, 2])

# The following code throws because capsule is already consumed.
t3 = torch.from_dlpack(capsule)

The way things works is that when the consumer choose to de-allocate later, it will call into the deleter in the DLManagedTensor. A common implementation of a deleter will then decrease the refcount to the array object.

For example, in order to implement np.to_dlpack, we will call PyIncRef on the numpy object, and put the object pointer into the manager_ctx field. Then the deleter will call into PyIncRef.

The memory will be released only after both x and t2 goes out of scope. Notably, one can choose to not consume the capsule at all. In that case, the PyCapsule will call the deleter instead, and there won't be any memory leak.

So to sum up, the above mechanism should be aligned with the example you provide. For example, we could just redirect __dlpack__ to to_dlpack. And call from_dlpack from the as_tensor function.

rgommers commented 4 years ago

Thanks @tqchen, makes sense. Is there a reason then for the consume-once? Maybe related to how some device support and memory management functions?

It's a little confusing, for example this works fine:

import jax
import jax.dlpack

import torch
import torch.utils.dlpack

j = jax.numpy.arange(3)
capsule = jax.dlpack.to_dlpack(j)
t = torch.utils.dlpack.from_dlpack(capsule)

But run the exact same code interactively, and you get a RuntimeError (presumably because the interpreter makes a call to __repr__ or something similar):

In [2]: %paste
import jax
import jax.dlpack

import torch
import torch.utils.dlpack
## -- End pasted text --

In [3]: j = jax.numpy.arange(3)

In [4]: capsule = jax.dlpack.to_dlpack(j)

In [5]: t = torch.utils.dlpack.from_dlpack(j)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-60aa16dd583b> in <module>
----> 1 t = torch.utils.dlpack.from_dlpack(j)

RuntimeError: from_dlpack received an invalid capsule. Note that DLTensor capsules can be consumed only once, so you might have already constructed a tensor from it once.

tqchen commented 4 years ago

The consume once requirement comes from how the memory management is done in the DLPack -- we will need a language agnostic way to signal memory-recycling.

In particular, the DLManagedTensor contains a deleter that allows the consumer to signal that the tensor is no longer needed. Because the way the signature is designed, we need to make sure that there is a sole consumer of the DLManagedTensor so it is only called once when the consumer no longer needs the memory(otherwise it will cause a double free).

Of course, we can also change the signature to include refcounting(e.g. call IncRef when there is a copy) in DLManagedTensor, however, that means additional requirement that not every exporter might support.

Your particular repr code contains a typo t = torch.utils.dlpack.from_dlpack(j) => t = torch.utils.dlpack.from_dlpack(capsule)

rgommers commented 4 years ago

Your particular repr code contains a typo

Oops, sorry for the noise - doing too many things at once.

Of course, we can also change the signature to include refcounting .... however, that means additional requirement that not every exporter might support.

Yes, I'm not trying to suggest changes, just trying to wrap my head around how things work and the Python API. There's view-vs-copy semantics there as well, e.g. if I construct a torch.Tensor from a numpy.ndarray, they share memory and mutating the torch.Tensor affects both (in your example). Doing the same with PyTorch + JAX one can still mutate the torch.Tensor, but that doesn't affect the (immutable) JAX array.

tqchen commented 4 years ago

In the specific case of DLPack, the data content should be able to mutate (as in the numpy example) from consumer's PoV. I do not know what is happening in the JAX case, perhaps what is happening is that they generate a copy (to preserve immutablity) instead.

szha commented 4 years ago

they share memory and mutating the torch.Tensor affects both (in your example)

This would require coordination in asynchronous setting, and I'm not sure if we'd want to make the explicit requirement that this data exchange solves the coordination on writing to the shared space. Also, regarding view, I think requiring anything beyond a read-only view may be troublesome as it takes extra care to deal with effect in a compiler. It might be better to leave that decision to each framework.

tqchen commented 4 years ago

The read-only view is fine. My take is that a generalization of read-only(move of ownership) also makes sense. In terms of async write, if both uses the same stream, the behavior will still be correct. But I agree that it is something that can be defined as per framework behavior.

rgommers commented 4 years ago

@kkraus14 gave the feedback that for RAPIDS the Python level usage of DLPack has been troublesome to support, due to the semantics of "delete on consumption". And that regular Python refcounting behavior (e.g. like __cuda_array_interface__) is easier to support. @kkraus14 if you have specific issues you can link to here, that would be helpful.

@honnibal was positive about the spaCy/Thinc interop with PyTorch via DLPack on Twitter. @honnibal do you have any more thoughts on this? Anything you would add/change?

kkraus14 commented 4 years ago

I think it's more that there isn't an official Python container / spec anywhere, but everyone has followed suite of using a PyCapsule object and changing its name on use: https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/_lib/dlpack.pyx#L32-L34

Then the deletion behavior is controlled based on the name: https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/_lib/dlpack.pyx#L84-L93

On the other hand for __cuda_array_interface__ everything is just based on Python refcounting and garbage collection. That being said, this does leave issues for when users want to hand the lifetime management down to a C/C++ layer.

tqchen commented 4 years ago

To summarize, the support for deletion is fine via PyCapsule, except that due to the dependency on the "use_dltensor" is a bit twisted. The deletion code will need to check that field as per the Cython code linked by @kkraus14, however functionality wise it works fine.

The way we use PyCapsule spec itself can also be changed(however that also requires potential PRs to the frameworks). For example, another possible cleaner way to is to simply consume the capsule and set the deleter to None.

oleksandr-pavlyk commented 4 years ago

It would be very useful to see an OpenCL implementation of DLPack interoperability, specifically the use of DLContext.device_id.

Suppose of SYCL application would like to share data allocated via SYCL Unified Shared Memory. The USM shared memory is bound to a SYCL context, which a receiver needs to make sense of the pointer. The only way for the exported to pass it along in the DLTensor is to understand that DLTensor.data be a pointer to a struct, that holds the USM pointer and the associated SYCL context.

Is this going against the grain of intended DLPack usage?

tqchen commented 4 years ago

@oleksandr-pavlyk We will then need to define a a common way to refer to a device context. For example, in the case of CUDA, the devices can be simply referred to by numbers. If there is additional convention that is agreed upon between the applications(e.g. what does SYCL context 0 mean) then such exchange is possible like in the case of CUDA.

My understanding is some level of standardization is necessary, if each of the application still like to hold their own SYCL context, then it is harder for such exchange like in the case of CUDA, as is not very realistic for application to understand the device context from the another application.

oleksandr-pavlyk commented 4 years ago

@tqchen The sycl context is not an int, (see https://developer.codeplay.com/products/computecpp/ce/api-reference/classcl_1_1sycl_1_1context). It may encapsulate a sequence of devices on a common platform (using the same driver). In SYCL data transfer between devices in the same context can be optimized by SYCL runtime to be done directly avoiding the host.

Here is a table comparing CUDA-world entities to SYCL-world ones: https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/migration

In the case of OpenCL, the DLTensor.data is documented to point to cl_mem object which encapsulates the OpenCL context.

One way of using DLPack to share data referenced by USM pointers is to for the receiver and the exporter to agree that DLTensor.data will point to a struct with two void* members, one being the USM pointer, the other being reference to cl::sycl::context.

tqchen commented 4 years ago

Thanks @oleksandr-pavlyk . I understand your proposal(and I made that remark in the last comment) and how SYCL works.

However, as my last comment. Passing cl::context around would require the consumer to make use of the sycl context being passed from another application.

From the application developer's PoV, such additional flexibility from the data structure side can increase the overhead of the application development (I am speaking with my past experiences developing deep learning systems). Since most of the applications would like to manage their own context, and may not be ready to directly use a context passed externally (e.g. due to the need of synchronization with other internal data under internal context etc).

So in this case a programming model like CUDA is still desirable. If SYCL or applications can agree on a set of context(e.g. put them in a table) before hand, and use integer to refer to these contexts. Of course there is not standardization around this area yet.

honnibal commented 4 years ago

@honnibal was positive about the spaCy/Thinc interop with PyTorch via DLPack on Twitter. @honnibal do you have any more thoughts on this? Anything you would add/change?

To flesh out a little what we're doing:

We use DLPack to exchange arrays between CuPy and PyTorch, which is allowing us to backprop through layers implemented in different frameworks. We're also using DLPack inside a custom allocator for CuPy. Instead of having CuPy ask for memory directly from the device, the allocator gets memory from PyTorch, and passes it to CuPy via DLPack. This prevents memory contention between the libraries. I haven't tested the MXNet integration very heavily, but we expect the MXNet interoperation to work the same. We've been eagerly awaiting TensorFlow support for DLPack. Heck, we'd settle for even a way to get a buffer with a device copy. Currently we can't communicate with TensorFlow without copying data via the CPU, which I find quite unbelievable.

So far things are working fine. However, I understand that the DLPack standard may introduce complexities that I'm not seeing, as I'm relying on other people's implementations. We would have no problem adopting a different standard instead.

rgommers commented 4 years ago

Thanks @honnibal, that's very useful detail.

Currently we can't communicate with TensorFlow without copying data via the CPU, which I find quite unbelievable.

https://www.tensorflow.org/api_docs/python/tf/experimental/dlpack just landed it seems? The TensorFlow devs participating in this Consortium seem also in favour of standardizing on DLPack.

Tried with a few NumPy devs too, they will need some more docs and context. The main pain point is probably complex number support. With CuPy, JAX, TensorFlow and PyTorch all supporting or in the middle of implementing such support, it seems essential to at least have agreement on it being added to DLPack in the future. Only MXNet doesn't have it as far as I can tell (it has "complex (for FFT)" on its 2.0 roadmap though).

tqchen commented 4 years ago

I personally also agree that complex should be part of the DLPack in the future and would love to see possible thoughts. The main question though is whether we want to standardize around array of struct vs struct of arrays.

array of struct is closer to other data types.
struct of arrays might be better for vectorizations purposes.

rgommers commented 4 years ago

Getting back to this after a bit too much delay to get the first draft of the full standard published for review. Overall most array/tensor library maintainers seemed to be enthusiastic about standardizing on DLPack (perhaps with some tweaks depending on detailed specification of ownership semantics). Here is the relevant section of the API standard doc: https://data-apis.github.io/array-api/latest/design_topics/data_interchange.html

We've been eagerly awaiting TensorFlow support for DLPack. Heck, we'd settle for even a way to get a buffer with a device copy. Currently we can't communicate with TensorFlow without copying data via the CPU, which I find quite unbelievable.

This has landed in tf.experimental earlier this year:

import tensorflow as tf

# Roundtrip
# ---------
x = tf.range(3)
capsule = tf.experimental.dlpack.to_dlpack(x)
x2 = tf.experimental.dlpack.from_dlpack(capsule)

# TensorFlow - PyTorch interop
# ----------------------------
import tensorflow as tf
import torch.utils.dlpack

x = tf.range(3)
capsule = tf.experimental.dlpack.to_dlpack(x)
x2 = torch.utils.dlpack.from_dlpack(capsule)

x2 += 1
assert x2[2] == 3  # sanity check we got the data

capsule2 = torch.utils.dlpack.to_dlpack(x2)
x3 = tf.experimental.dlpack.from_dlpack(capsule2)

assert x3[2] == 3

rgommers commented 4 years ago

This is what the Python API for each library that supports DLPack looks like:

# Single library round-trip
# -------------------------

# JAX
import jax
import jax.dlpack

x = jax.numpy.arange(3)
# Note: take_ownership=False (default) requires jaxlib 0.1.57, released 11 Nov 2020
#       this is a mode where the user guarantees not to mutate the buffer
#       see https://github.com/google/jax/issues/4636
capsule = jax.dlpack.to_dlpack(x, take_ownership=True)
x2 = jax.dlpack.from_dlpack(capsule)

# PyTorch
import torch
import torch.utils.dlpack

x = torch.arange(3)
capsule = torch.utils.dlpack.to_dlpack(x)
x2 = torch.utils.dlpack.from_dlpack(capsule)

 # CuPy
import cupy as cp

x = cp.arange(3)
capsule = x.toDlpack()
x2 = cp.fromDlpack(capsule)

# TensorFlow
import tensorflow as tf

x = tf.range(3)
capsule = tf.experimental.dlpack.to_dlpack(x)
x2 = tf.experimental.dlpack.from_dlpack(capsule)

# MXNet
import mxnet

x = mxnet.nd.arange(3)
# MXNet also has to_dlpack_for_write(), with identical docs (?)
# Looks like the same idea as JAX: keep ownership if _for_read(),
#                                  consume if _for_write().
capsule = x.to_dlpack_for_read()
x2 = mxnet.nd.from_dlpack(capsule)

The interesting split seems to be that JAX and MXNet provide two export mechanisms: one which keeps ownership, one which transfers it. The JAX "keep ownership" implementation is new in the yesterday released version of jaxlib.

CuPy, PyTorch and TensorFlow just have the default semantics that @tqchen described in this issue it looks like: the capsule consumer calls the deleter.

EDIT: copying the relevant comment from the JAX issue: This is by design: because DLPack has no concept of a read-only tensor, so for safety, converting a buffer to DLPack consumes it. We could probably add a mode where you promise faithfully that you will not touch the buffer and then exporting it to dlpack would not necessarily need to consume it.

rgommers commented 4 years ago

Links to all the implementations for reference:

DLPack itself: https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h
CuPy: https://github.com/cupy/cupy/blob/master/cupy/core/dlpack.pyx
JAX: https://github.com/google/jax/blob/master/jax/_src/dlpack.py
TensorFlow: Python code, C++ code, DLPack in XLA
PyTorch: Python code, C++ code
MXNet: Python code, C++ code
TVM: Python code, Cython code, C++ code

rgommers commented 4 years ago

The JAX take_ownership thing is interesting, its default of invalidating its own buffer on capsule export for safety makes sense (given immutable data structure), but one would expect the same to apply for TensorFlow. TensorFlow is actually happy to give the user the responsibility - this lets you do things like item assignment for eager tensors, even though TF itself doesn't support that:

In [2]: x
Out[2]: <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>

In [3]: x2   # obtained from `x` via DLPack
Out[3]: tensor([1, 2, 3], dtype=torch.int32)

In [4]: x[0] = 9
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b79ee06b5385> in <module>
----> 1 x[0] = 9

TypeError: 'tensorflow.python.framework.ops.EagerTensor' object does not support item assignment

In [5]: x2[0] = 9

In [6]: x
Out[6]: <tf.Tensor: shape=(3,), dtype=int32, numpy=array([9, 2, 3], dtype=int32)>

EDIT: re JAX default, that's actually changing, the new default will be take_ownership=False - making behaviour similar to other libraries.

rgommers commented 4 years ago

My current thoughts on the Python API are:

Keep the explicit from_dlpack, helps code readers understand what is happening and is a sensible API. asarray / as_tensor is a bit too forgiving, often makes copies, etc. so is less ideal.
Exposing the concept of a capsule to the user seems unnecessary. If from_dlpack would take an object that advertises it supports DLPack (e.g. via __dlpack__), the capsule creation + consumption always go together. That saves code and one does not have to bother with scenarios like the capsule not being consumed or being consumed twice. A capsule itself at the Python level isn't useful for anything. I don't see any downsides, other that having to pick a new name for from_dlpack for backwards compat reasons.

EDIT:

The from_dlpack function belongs in the main namespace (the array API namespace, which is likely not the package main namespace for existing libraries) rather than hidden away in .utils.dlpack or .experimental.dlapck or some such thing.

tqchen commented 4 years ago

I agree that from interface's pov it makes sense (from_dlpack and __dlpack__ combo).

From the implementation's pov, it might still makes sense to introduce capsule as part of the standard(and hide from the user). Mainly there is transition period between creation and consumption. Having a capsule as the intermediate exchange object would avoid memory leak in the rare case when an error happens during ingestion. It also standardizes the way C pointers are stored.

rgommers commented 4 years ago

Having a capsule as the intermediate exchange object would avoid memory leak in the rare case when an error happens during ingestion. It also standardizes the way C pointers are stored.

I agree, that makes sense. And is the easiest thing to do, current implementations wouldn't need any change except a small Python API tweak.

kkraus14 commented 3 years ago

Raising a point for discussion here, dlpack is a C header and generally requires having a C library underneath the Python library in order to utilize. This would prevent a pure Python or PyPy array library for example from implementing this exchange protocol in a straightforward way.

Take Numba as an example. Historically, it has had a pure python device array class which uses ctypes to allocate / deallocate memory. In order to implement an exchange protocol based on dlpack, they'd need to write some very non-trivial ctypes code interacting with the Python C-API to increment / decrement reference counts to make the deleter function work as expected. NOTE: There's ongoing work in introducing a small backend C class for device arrays so this likely isn't a real issue in practice for Numba, but wanted to use it as an example.

I'm all for using dlpack for C/C++ libraries, but I'm -1 on using it for a Python interchange protocol.

tqchen commented 3 years ago

Thanks @kkraus14 for the great point

To address some of the concerns on the cost to implement the exchange via DLPack. Given the C FFI in DLPack is relatively simple, the general it is possible to implement a variant, additionally, we could provide examples ways that might help such kind of python wrapping in a generic fashion:

Example ctypes impl for numpy https://github.com/dmlc/dlpack/blob/main/apps/from_numpy/main.py
Cython https://github.com/apache/tvm/blob/main/python/tvm/_ffi/_cython/ndarray.pxi

Additionally, for a generic python array API, we might be able to create effective cython based adapter that exposes them to DLPack objects, which might help alleviate the concern (assuming there is such a need, but also see the pt below).

One could also argue that any of python array library would need to interact with a C based interface(assuming the need to call into BLAS or other cases), whether it is via a path through ctypes, cffi, cython or python's C API. So the addtional DLPack C dep may not be huge if we have good reference implementations in these cases like the above example.

Finally, the main advantage of such exchange is the ability to reuse compilation flow (which usually needs code that target the C ABI) and benefit the frameworks themselves to make use of solutions that comes both from python land, and more language agnostic front.

kkraus14 commented 3 years ago

To address some of the concerns on the cost to implement the exchange via DLPack. Given the C FFI in DLPack is relatively simple, the general it is possible to implement a variant, additionally, we could provide examples ways that might help such kind of python wrapping in a generic fashion:

Example ctypes impl for numpy https://github.com/dmlc/dlpack/blob/main/apps/from_numpy/main.py

Cython https://github.com/apache/tvm/blob/main/python/tvm/_ffi/_cython/ndarray.pxi

One could also argue that any of python array library would need to interact with a C based interface(assuming the need to call into BLAS or other cases), whether it is via a path through ctypes, cffi, cython or python's C API. So the addtional DLPack C dep may not be huge if we have good reference implementations in these cases like the above example.

Not every array library needs to use BLAS. I.E. Numba device array does not use BLAS. It does call into libcuda via ctypes to allocate memory for underneath the array.

Making a Python library that has the Cython headers / CTypes classes where someone can just nicely plug in their class(es) info and it nicely returns a PyCapsule or something similar that handles refcounting in the deleter function would be very nice.

Finally, the main advantage of such exchange is the ability to reuse compilation flow (which usually needs code that target the C ABI) and benefit the frameworks themselves to make use of solutions that comes both from python land, and more language agnostic front.

I agree completely that dlpack is great for going between various languages that can talk to C/C++ under the hood, just the current state of how a Python library implements dlpack is quite burdensome.

tqchen commented 3 years ago

Not every array library needs to use BLAS. I.E. Numba device array does not use BLAS. It does call into libcuda via ctypes to allocate memory for underneath the array.

Right, what I meant is that most array library would need to interact with C ABI(but not necessarily programming via C/C++). For example, in the case of numba, it interface with C ABI via ctypes. We certainly do not want to force libraries to use C compiler though, given there are multiple path to interface with C ABI such as ctypes, cython, or pybind. So having example and libraries. for each part would also help.

rgommers commented 3 years ago

When trying to update my PR for device support, where the initial feedback was that specifying a device object in the API and a way to identify specific physical devices (e.g. 'gpu:1'), that was too detailed. However, DLPack does have a device index:

typedef struct {
  /*! \brief The device type used in the device. */
  DLDeviceType device_type;
  /*! \brief The device index */
  int device_id;
} DLContext;

Is this device_id guaranteed to be consistent between libraries for all device types, corresponding to the way the ~OS~ driver (e.g. CUDA) labels them?

tqchen commented 3 years ago

In the particular case of CUDA/Rocm yes, in other cases.

Notably, not all the deriver API have a standard mapping from id to context(e.g. in the case of opencl). While the ability of creating context seems to be more flexible, it usually creates more problem for cross-app collaborations. Since different applications usually create and manage their own context, and won't otherwise work with the context that other applications uses. Having some form of standardization probably would be useful.

szha commented 3 years ago

In the particular case of CUDA/Rocm yes

Even in the GPU case, knobs like CUDA_VISIBLE_DEVICES can tweak what GPU:1 refers to, so it doesn't always refer to the same physical device in different processes. Within a process, since that variable is read only once at initialization, it is guaranteed that GPU:1 always refer to the same device.

rgommers commented 3 years ago

That's what I suspected. So in practice this will typically work pretty well (cause CUDA is normally consistent, in the absence of messing with env vars), but it may fail. When a library does from_dlpack(...), it will create an array with an associated device, and it may be incorrect. Since libraries will typically have a device associated with each array, you may have cases where devices compare as equal but the data doesn't actually reside on the same device.

Having some form of standardization probably would be useful.

rgommers commented 3 years ago

Having some form of standardization probably would be useful.

I'm inclined to add a note of caution to the API standard doc now. Agreed it would be useful and is probably going to become more important over time.

tqchen commented 3 years ago

I agree, just call it out to clarify the status

leofang commented 3 years ago

Sorry to bring up a question if this was already discussed somewhere 😅 I am a newcomer here trying to catch up with the massive discussion:

Is this device_id guaranteed to be consistent between libraries for all device types, corresponding to the way the ~OS~ driver (e.g. CUDA) labels them?

Why don't we look up which device it is through the cudaPointerAttributes/hipPointerAttribute_t struct associated with the device pointer? This would be guaranteed to work on NVIDIA/AMD GPUs, at the driver level, so in theory DLPack doesn't even need to contain this information, just the pointer address and the array metadata. At least this is what CuPy does when encountering unowned memory (allocated from other libraries).

I imagine OpenCL/SYCL might have similar look-up capability, but I am not familiar with them enough and need to do my homework.

tqchen commented 3 years ago

@leofang The specific property really depends on how the driver is implemented. While it can be true for unified memory model(CUDA, rocm case). Such API is not guaranteed for opaque memory address(in the case of opencl, vulkan, metal).

leofang commented 3 years ago

Ah I missed it, sorry @tqchen!

Such API is not guaranteed for opaque memory address(in the case of opencl, vulkan, metal).

Thanks, it's good to confirm. I suppose OpenCL is the most important player for the purpose of Array API.

rgommers commented 3 years ago

I opened https://github.com/data-apis/array-api/pull/106 to add relevant content from this discussion to the API standard document.

rgommers commented 3 years ago

There's still the issue of where to put DLPack docs, right now it's mostly in the C header. High-level docs like purpose, scope, semantics and Python API are missing in the dmlc/dlpack repo - as discussed higher up at https://github.com/data-apis/consortium-feedback/issues/1#issuecomment-675565223.

Links to implementations and helpful content like how to put together a ctypes or cffi interface, are mostly contained in this discussion. We could put them in a separate Sphinx-generated site and host it from this org. Using the same theme as https://data-apis.github.io/array-api/latest/, and making it a similar API.

Or we could just add more docs to https://github.com/dmlc/dlpack, either in its README or with html docs. It's mostly up to your preference I think @tqchen, what do you think? I'm happy to help either way.

tqchen commented 3 years ago

We are open to both options. Given it is simple enough we agree that we could work to improve https://github.com/data-apis/array-api/pull/106 and cross reference.

rgommers commented 3 years ago

Given it is simple enough we agree that we could work to improve data-apis/array-api#106 and cross reference.

That sounds good to me.

data-apis / consortium-feedback

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange #1

Complex Number