Initializing a TMA descriptor through the driver APIs
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
is really tedious and error prone. We need a way to abstract it out, which aligns well with the mission of cuda.core. This also allows JIT compilers to easier consume and incorporate into the compilation pipelines.
In my understanding there are two (implicit?) requirements for this to be useful:
Creating/initializing a TMA object on host
Passing the object to the cuda.core.launch() API as a kernel arg
Initializing a TMA descriptor through the driver APIs https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html is really tedious and error prone. We need a way to abstract it out, which aligns well with the mission of
cuda.core
. This also allows JIT compilers to easier consume and incorporate into the compilation pipelines.In my understanding there are two (implicit?) requirements for this to be useful:
cuda.core.launch()
API as a kernel arg