dmlc / dlpack

common in-memory tensor structure
https://dmlc.github.io/dlpack/latest
Apache License 2.0
905 stars 133 forks source link

Clarify alignment requirement #99

Open tirthasheshpatel opened 2 years ago

tirthasheshpatel commented 2 years ago

DLPack has this comment in the header file:

  /*!
   * \brief The data pointer points to the allocated data. This will be CUDA
   * device pointer or cl_mem handle in OpenCL. It may be opaque on some device
   * types. This pointer is always aligned to 256 bytes as in CUDA. The
   * `byte_offset` field should be used to point to the beginning of the data.
   *
   * Note that as of Nov 2021, multiply libraries (CuPy, PyTorch, TensorFlow,
   * TVM, perhaps others) do not adhere to this 256 byte aligment requirement
   * on CPU/CUDA/ROCm, and always use `byte_offset=0`.  This must be fixed
   * (after which this note will be updated); at the moment it is recommended
   * to not rely on the data pointer being correctly aligned.
   * ...
   */

This was discussed in https://github.com/data-apis/array-api/issues/293 and came up in NumPy https://github.com/numpy/numpy/issues/20338.

This comment by @rgommers summarizes the issue very well. Quoting the options in the comment:

These are the options:

  1. A1: required alignment. Require the data pointer to always be aligned (using nonzero byte_offset), and do the gradual evolution plan in my comment above.

  2. A2: no alignment. remove the allocation requirement completely from dlpack.h. no library needs to make any changes (except if current handling of byte_offset is buggy, like @seberg pointed out for PyTorch). NumPy and other new implementers then just use byte_offset=0 always (easiest), and we're done.

  3. A3: optional alignment. Do not require alignment, but add a way to communicate from the producer to the consumer what the alignment of the data is.

The current status is that the fine print in dlpack.h requires alignment (option A1), but no one adheres to it or enforces it. This state is not very useful: it requires a >1 year evolution plan, and apparently there's no gain because of the third bullet above. So it looks like the best choices are either A2 or A3. A3 seems strictly better than A2, and most of the work it requires (versioning/extensibility) is work we wanted to do for other reasons already.

So here's a new proposal:

  • Decide that the long-term desired state is A3: optional alignment

  • NumPy and other new implementers to do whatever is simplest, i.e. to use byte_offset = 0 and data pointing to the first element in memory.

  • Update the comment in dlpack.h about this topic to reflect: current state, desired future state, and a link to a new issue on the DLPack repo with more info (outcome of this discussion to be summarized on that issue).

I agree with @rgommers that A3 is the best option because: most libraries don't care about alignment and we can communicate the alignment to some that do using the __dlpack_info__ dunder or a request API under discussion in #34. If others agree, then let's add that to the spec and remove the comment from the header.

I noticed Tensorflow 2.8.0 crashed with NumPy 1.22.3 (on Ubuntu 20.04) due to an alignment issue:

import numpy, tensorflow

x = numpy.arange(5, dtype='int64')
tf_x = tensorflow.experimental.dlpack.from_dlpack(x.__dlpack__())
tf_x[1]  # Fatal Failure
# 2022-03-15 18:29:10.587129: F ./tensorflow/core/framework/tensor.h:776] Check failed: IsAligned()
# Aborted

This happens for float16, float32, and int64.

(This crash doesn't happen for Tensorflow 2.7.0 and NumPy 1.22.3 but for all other versions, it does)

So, it'd be good to communicate alignment to the importing library. I don't think it'd be difficult to calculate that at importer side but it could be useful for opaque pointers.