Closed leofang closed 7 months ago
I have a vague memory of this alignment requirement being pointed out before and the answer being that it's unnecessary and can be ignored. Some digging turned up https://github.com/numpy/numpy/issues/19013#issuecomment-841745443, where you (@leofang) said: "please ignore the alignment stuff. If we are talking about host-device copies, it's an implementation detail (of CUDA, HIP, or any sane memory pool implementation sitting on top). CUDA/HIP's copy APIs do not care about alignments. DLPack does not. CUDA Array Interface does not. No one does."
Thanks, Ralf, yes now I recall, I guess it's a rediscovery then 😅 Maybe I just need to add a note to this page for better visibility? https://data-apis.org/array-api/latest/design_topics/data_interchange.html
This is certainly something that worth discussion.
To give background of the original rationale in DLPack.
Such requirement is not necessary for host/device copies of any kind as we rightfully point out that copy APIs does not care about alignments. The alignment spec comes in handy when we build kernels that leverages those data structure, as vectorization usually have higher amount of alignment requirements.
Of course there is a need to support sliced array, which do not have the proper alignment. Similar problems will happen in the context of OpenCL, where the data ptr is opaque(you cannot add an offset directly to that field).
The byte_offset
field was meant to enable such offset in the context of OpenCL, and other cases where offsetting can cause not aligned array.
The alignment spec comes in handy when we build kernels that leverages those data structure, as vectorization usually have higher amount of alignment requirements.
Naive question - is there any advantage of enforcing 256 byte alignment plus always checking byte_offset
, over not requiring alignment at all, or only alignment on a more "normal" boundary? 256 byte is pretty unusual I think - the linked issue for example explains that Arrow requires 8 byte and recommends 64 byte alignment. I can't remember seeing something >64 byte before elsewhere.
And if there is such an advantage, I guess you're proposing to start checking it in each implementation? This may require a slow path, with emitting warnings for at least a year before starting to raise an exception.
Larger alignment generally have an advantage of making vectorization simple(in the presence of avx2 or 512) in the case of CPU. Similarly in the case of GPU can inform a more aligned loads in float4. The particular number was picked mainly because it was CUDA's Malloc default alignment.
I agree start with a warning would be useful, given most do not start with such an assumption.
Of course there is a need to support sliced array, which do not have the proper alignment.
Yes, so why even discuss enforcing any alignment beyond the itemsize? Slicing is a central feature, so almost all libraries have to anticipate data that is at most itemsize-aligned. I don't understand how you can possibly do any more than recommend allocating data with a larger alignment (e.g. because certain compute libs may not accept the data otherwise – which is fair).
Calculating alignment, and rejecting data that is not hard! You could also ask the exporter to provide the information (e.g. as a power-of-two alignment – many exporters likely already store it in some form; I would like to for NumPy) or you expect the importer to calculate it.
To be nice: provide a helper for it (which also helps to correctly handle length <= 1
dimensions as their stride does not matter for alignment/contiguity).
Thanks @seberg. In the specific case of DLPack, there is also a need of supporting opaque data ptr (in the case of vulkan and opencl), in those cases the byte_offset
was used to support slicing(while the data itself remained aligned as the original allocation) in a native way.
Just trying to bring up context here and background for both cases, not necessarily arguing one versus another
The use of byte_offset
for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?
How is the current header comment/standard meant? Are we supposed to use byte_offset
for non-opaque CPU data pointers or not?
Based on the plaintext interpretation of the header comment, if the data is not aligned, a usage of byte_offset
is certainly encouraged to unify that with opaque ptr situation. Of course on CPU there is not really great advantages as checking the ptr itself is equally simple.
In reality, i think a lot of the current exchange impls ignores that constraint, so it becomes an open question whether or not we encourage such a convention, or allow people to pass in their own alignment preference.
The use of
byte_offset
for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?
cc @shwina @jrhemstad
Yes this would be really helpful. Knowing the base allocation pointer, offset, and alignment often allows for much more efficient code.
@kkraus14 OK, that makes sense to me. I am still not sure if you would want to expect the importer to calculate both base and non-base pointer alignment? Or should DLPack grow a field to indicate alignment (e.g. for opaque pointers). If it is padding and alignment, would that mean that the exporter must indicate it anyway (i.e. for padding)?
I suppose alignment could be device specific, maybe some devices malloc
provides those guarantees always?
Would you suggest changing the dlpack header to explicitly say:
itemsize
alignment is good, but could be slightly "tricky" e.g. for complex longdoubles on CPU)byte_offset
should be used, so that the data pointer represents the full allocation alignmentI am still not sure if you would want to expect the importer to calculate both base and non-base pointer alignment?
I would assume that allocations are aligned, which means the base pointer is aligned, but the non-base pointer is not necessarily aligned and is determined by the offset.
Or should DLPack grow a field to indicate alignment (e.g. for opaque pointers). If it is padding and alignment, would that mean that the exporter must indicate it anyway (i.e. for padding)?
I would love to specify both padding and alignment, but beggars can't be choosers 😄.
I suppose alignment could be device specific, maybe some devices
malloc
provides those guarantees always?
Memory pools are super common that don't necessarily have to follow the alignment / padding guarantees of the base malloc
implementation.
Would you suggest changing the dlpack header to explicitly say:
- All exported data should be at least aligned to (device alignment for the datatype OR itemsize)
I don't know if we want to go to the level of device alignment given the number of memory pools out there. Requiring the alignment to be a multiple of the itemsize makes sense to me though.
- If the data is a view,
byte_offset
should be used, so that the data pointer represents the full allocation alignment
Yes
- Adding an indicator for the alignment (padding?) of the data pointer could be useful?
Yes, adding both would be ideal and would allow more optimized code
Ah, so padding
means actually means "you are allowed to write over that data?" and not just "you can safely read more?". The only thing that makes me a bit unsure about itemsize alignment that I realized is that NumPy would currently have some trouble guaranteeing it for complex longdouble
.
Ah, so
padding
means actually means "you are allowed to write over that data?" and not just "you can safely read more?".
Ideally, yes. I imagine the read side is more important, but generally being able to leverage vector loads / stores is typically important for optimizations.
The only thing that makes me a bit unsure about itemsize alignment that I realized is that NumPy would currently have some trouble guaranteeing it for
complex longdouble
.
Maybe there's an upper limit to where alignment doesn't matter relevant to itemsize anymore? I.E. anything above 64 bits or 128 bits only needs 64 bit or 128 bit alignment?
At this point, all I care about is to have a definite answer which exports NumPy is supposed to reject due to insufficient alignment (or readonly data for that matter). And that this answer is added to dlpack.h
.
pytorch has this:
atDLMTensor->tensor.dl_tensor.byte_offset = 0;
And it supports slicing, right? So, torch can't possibly have export guarantees beyond the itemsize
. It seems safe to assume that nobody else cared about it enough to provide alignment beyond what their library guarantees (otherwise, this question would have come up more often).
Now, I don't know what their allocation strategy is, so I don't know whether they even guarantee itemsize-alignment for complex128
on 32bit platforms. Since GCC has this to say:
The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems). (link(
which guarantees 8-bytes while complex-double has 16. So even itemsize-aligned is probably broken (albeit only for complex doubles).
As for their fromDLPack
function just below: It doesn't even check if byte_offset != 0
, so maybe we should just always keep it 0 and and not think about it for CPU data... (unless you want to open a bug report at pytorch after clarifying this.)
As far as I am aware there is no versioning for __dlpack__
or __dlpack_device__
and no way to move API/ABI forward without a dance using some kind of try/except
. So, while I find it interesting, I will give up discussing what may be nice to have until that is clarified. (Yes, I am cross. I had explicitly warned about this.).
To be painfully clear: I very much still think that DLPack in its current form needs both extension and better extensibility to be the good implicit protocol that data-apis should in my opinion aim for. (I really don't hate DLPack, but I don't see that was ever tuned or vetted for this task.)
With that: Sorry, I am going to try and leave you all in peace now... If you want to consider extending DLPack liberally and with versioning, I will be interested (but also happy to stay away to leave you in peace – seriously, just tell me). Until then, I will simply try to just stop caring it.
As far as I am aware there is no versioning for
__dlpack__
or__dlpack_device__
and no way to move API/ABI forward without a dance using some kind oftry/except
.
Actually this is not my first time hearing the very same comment. I do believe we should get this and other DLPack issues (tracked in the meta-issue https://github.com/dmlc/dlpack/issues/74) addressed for the v2021 milestone.
Now, I don't know what their allocation strategy is, so I don't know whether they even guarantee itemsize-alignment for
complex128
on 32bit platforms.
PyTorch doesn't support 32-bit platforms AFAIK. It's not impossible that there's a custom armv7 mobile build or some such thing floating around, but certainly there's no wheels, conda packages or CI for 32-bit platforms.
EDIT: confirmed, trying to build PyTorch for a 32-bit platform will raise an exception immediately. Only libtorch
(C++) has a 32-bit build.
As for their
fromDLPack
function just below: It doesn't even check ifbyte_offset != 0
, so maybe we should just always keep it 0 and and not think about it for CPU data...
This may be true for other libraries as well. It seems clear from the above this discussion that there's a desire to have alignment better specified and in the future enforced. It looks to me like this will need gradual evolution:
dlpack.h
how to deal with alignment. I opened a PR for that: https://github.com/dmlc/dlpack/pull/83from_dlpack
) in all known implementations to correctly handle byte offsets, and give warnings for non-aligned input.__dlpack__
/ to_dlpack
) to have an aligned base pointer and nonzero byte_offset
.The use of
byte_offset
for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?Yes this would be really helpful. Knowing the base allocation pointer, offset, and alignment often allows for much more efficient code.
It looks like the answer changed here (or at least my reading of it): (xref https://github.com/dmlc/dlpack/pull/83#issuecomment-964382536). Let me try to summarize:
data
and data + byte_offset
. Hence aligned loads are not possible today unless data + byte_offset
happens to be aligned already. These are the options:
data
pointer to always be aligned (using nonzero byte_offset
), and do the gradual evolution plan in my comment above.dlpack.h
. no library needs to make any changes (except if current handling of byte_offset
is buggy, like @seberg pointed out for PyTorch). NumPy and other new implementers then just use byte_offset=0
always (easiest), and we're done.The current status is that the fine print in dlpack.h
requires alignment (option A1), but no one adheres to it or enforces it. This state is not very useful: it requires a >1 year evolution plan, and apparently there's no gain because of the third bullet above. So it looks like the best choices are either A2 or A3. A3 seems strictly better than A2, and most of the work it requires (versioning/extensibility) is work we wanted to do for other reasons already.
So here's a new proposal:
byte_offset = 0
and data
pointing to the first element in memory.dlpack.h
about this topic to reflect: current state, desired future state, and a link to a new issue on the DLPack repo with more info (outcome of this discussion to be summarized on that issue).If we're making longer term changes surrounding optional alignment in A3, it would be nice to have optional padding as well so that we know how many bytes past data + byte_offset + size
can be safely read to allow further optimization as well.
@leofang Does it make sense to keep this issue open? I'm not sure the current status or whether things/perspectives have changed since 2021.
Recently we found that DLPack has this requirement noted in the header: https://github.com/dmlc/dlpack/blob/a02bce20e2dfa0044b9b2ef438e8eaa7b0f95e96/include/dlpack/dlpack.h#L139-L141. Would this be an issue for all adopting libraries? As far as I know, CuPy doesn't do any alignment check and take the pointer (
DLTensor.data
) as is, and IIRC quite a few other libraries are also doing this.cc: @seberg @rgommers @tqchen @kmaehashi @oleksandr-pavlyk @jakirkham @edloper @szha