Closed mehmetyusufoglu closed 1 week ago
offline discussed in the developer meeting: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaFuncAttributes.html#structcudaFuncAttributes provides more properties than the maximum number of threads per block. We should at least design the API so that we can get more than one property.
maxThreadsPerBlock
and maxDynamicSharedSizeBytes
are what all our backends can provide. I am not fully sure if for CPU backends maxDynamicSharedSizeBytes
must be reduced by the compile time shared memory, if so it will be hard to provide the correct amount of dynamic shared memory.
offline discussed in the developer meeting: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaFuncAttributes.html#structcudaFuncAttributes provides more properties than the maximum number of threads per block. We should at least design the API so that we can get more than one property.
maxThreadsPerBlock
andmaxDynamicSharedSizeBytes
are what all our backends can provide. I am not fully sure if for CPU backendsmaxDynamicSharedSizeBytes
must be reduced by the compile time shared memory, if so it will be hard to provide the correct amount of dynamic shared memory.
Done. A structure is returned for GPU and CPUs. For CPUs only maxthreadsperblock field is used and it is 1, for GPU backends all the values accessible by the Cuda and Hip APIs are set and usable.
I check our current code. All implementations of TaskKernel*
using a tuple to store the arguments internally therefore using the "kernelBoundle" object which is unifying this approche will reduce the possibility of implementation issues and should not be negative for the performance compared to the current state of the code.
IMHO the current approach is not correct: the values returned by getFunctionAttributes
, and so in turn the values returned by getValidWorkDivForKernel
, may depend on the current device, not just on the kernel.
getValidWorkDivForKernel
and getFunctionAttributes
should take an extra argument that is the Device to be queried.
please rebase against dev to get the fix from #2294 in and remove draft from the PR if it is ready
please rebase against dev to get the fix from #2294 in and remove draft from the PR if it is ready
Done, thanks.
getValidWorkDiv
has a bug. It only considers device hard properties (TApi::getDeviceProperties()
); does not consider the kernel function. Actually kernel function properties can limit number of threads per block.In the case of CUDA and ROCm, the function
getValidWorkDiv
should useTApi::funcGetAttributes(&funcAttrs, kernelName)
together withTApi::getDeviceProperties()
.I have added 3 functions.
getFunctionAttributes
,getValidWorkDivKernel
andisValidWorkDivKernel
. One function intrinsic to Alpaka changed:subDivideGridElems
. Now has an additional argument; "threads per block", if the value is zero the function uses device hard properties as before.Next PR: The 2 functions
getValidWorkDivKernel
andisValidWorkDivKernel
will be used in Examples and Tests instead of getValidWorkDiv and isValidWorkDiv at the next PR. The names can be determined.Credits to @psychocoderHPC for the PR. Tests: The tests for 1D and 2D workdivs were added.