JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.2k stars 218 forks source link

Change type restrictions in cuTENSOR operations #2356

Closed lkdvos closed 5 months ago

lkdvos commented 5 months ago

This PR migrates the type restrictions for the cutensor operations from the definitions of *operation*_execute! and plan_*operation* to the constructor of the CuTensorDescriptor. This implementation makes it such that custom types, which wrap CuArray objects can specialize just the CuTensorDescriptor constructor while still being able to re-use most of the codebase.

Some questions/remarks I am still having:

Note that the current implementation changes the CuTensorDescriptor function quite drastically, in order to expose the internal constructor. I could think of alternative ways to achieve this goal as well by renaming the inner constructor to _CuTensorDescriptor, which handles only the finalizer etc.

This change allows cuTENSOR to work almost trivially with StridedCuArrays as well. I played around with this a bit, and the only issue that shows up by changing DenseCuArray to StridedCuArray is that the data alignment can no longer be guaranteed. I have to admit that I do not have enough experience with this myself, but simply changing alignment=sizeof(eltype(A)) seems to work. Does anyone know if there are any issues associated with this approach? Should I be mindful of severe performance pitfalls? I have not added this because of these reasons, but if anyone can reassure me that this is fine, I can open up a different PR for implementing StridedCuArray as well.

Any suggestions are definitely more than welcome.

maleadt commented 5 months ago

Note that the current implementation changes the CuTensorDescriptor function quite drastically, in order to expose the internal constructor.

It's simple enough of a change that we should carry this to enable better integration with external packages.

simply changing alignment=sizeof(eltype(A))

That's an underestimation, which could hurt performance. It is probably also better to use Base.datatype_alignment for the minimal guaranteed alignment. However, for performance it's better to track the alignment of the buffer from where it was allocated, or determine it at run time. CUDA.jl doesn't currently do this either, which could result in misaligned inputs since we hard-code an alignment of 128 bytes right now (e.g., when using a view and passing it to cuTENSOR).

Let me know if you still want to make changes to this PR, otherwise I'd be OK with merging it as is.

lkdvos commented 5 months ago

For me this is good to go, I am going to have to experiment with the alignment anyways for the StridedView implementation, but this is now accessible through the CuTensorDescriptor constructor so that should work. Thanks!