Fix getValidWorkDiv for CUDA and ROCm [I] #2222

alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

https://alpaka.readthedocs.io

Mozilla Public License 2.0

337 stars 69 forks source link

Fix getValidWorkDiv for CUDA and ROCm [I] #2222 #2251

Closed mehmetyusufoglu closed 1 week ago

mehmetyusufoglu commented 3 months ago

getValidWorkDiv has a bug. It only considers device hard properties ( TApi::getDeviceProperties()); does not consider the kernel function. Actually kernel function properties can limit number of threads per block.

In the case of CUDA and ROCm, the function getValidWorkDiv should use TApi::funcGetAttributes(&funcAttrs, kernelName) together with TApi::getDeviceProperties().

I have added 3 functions. getFunctionAttributes,getValidWorkDivKernel and isValidWorkDivKernel. One function intrinsic to Alpaka changed: subDivideGridElems. Now has an additional argument; "threads per block", if the value is zero the function uses device hard properties as before.

Next PR: The 2 functions getValidWorkDivKernel and isValidWorkDivKernel will be used in Examples and Tests instead of getValidWorkDiv and isValidWorkDiv at the next PR. The names can be determined.

Credits to @psychocoderHPC for the PR. Tests: The tests for 1D and 2D workdivs were added.

psychocoderHPC commented 2 months ago

offline discussed in the developer meeting: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaFuncAttributes.html#structcudaFuncAttributes provides more properties than the maximum number of threads per block. We should at least design the API so that we can get more than one property.

maxThreadsPerBlock and maxDynamicSharedSizeBytes are what all our backends can provide. I am not fully sure if for CPU backends maxDynamicSharedSizeBytes must be reduced by the compile time shared memory, if so it will be hard to provide the correct amount of dynamic shared memory.

mehmetyusufoglu commented 2 months ago

offline discussed in the developer meeting: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaFuncAttributes.html#structcudaFuncAttributes provides more properties than the maximum number of threads per block. We should at least design the API so that we can get more than one property.

maxThreadsPerBlock and maxDynamicSharedSizeBytes are what all our backends can provide. I am not fully sure if for CPU backends maxDynamicSharedSizeBytes must be reduced by the compile time shared memory, if so it will be hard to provide the correct amount of dynamic shared memory.

Done. A structure is returned for GPU and CPUs. For CPUs only maxthreadsperblock field is used and it is 1, for GPU backends all the values accessible by the Cuda and Hip APIs are set and usable.

psychocoderHPC commented 1 month ago

I check our current code. All implementations of TaskKernel* using a tuple to store the arguments internally therefore using the "kernelBoundle" object which is unifying this approche will reduce the possibility of implementation issues and should not be negative for the performance compared to the current state of the code.

fwyzard commented 1 month ago

IMHO the current approach is not correct: the values returned by getFunctionAttributes, and so in turn the values returned by getValidWorkDivForKernel, may depend on the current device, not just on the kernel.

getValidWorkDivForKernel and getFunctionAttributes should take an extra argument that is the Device to be queried.

psychocoderHPC commented 2 weeks ago

please rebase against dev to get the fix from #2294 in and remove draft from the PR if it is ready

mehmetyusufoglu commented 2 weeks ago

please rebase against dev to get the fix from #2294 in and remove draft from the PR if it is ready

Done, thanks.