Open ancahamuraru opened 9 years ago
As this will increase register pressure, I suggest trying 128 threads/block too. Additionally, reduction will become tricky without the lane-shuffle ops.
Thanks for the comment. It's my mistake, I forgot to mention 128 threads/block. The issue title and description are now updated.
Update the OpenCL kernel for 128/256 threads/block based on the equivalent CUDA kernel - see commit f2b9db2 from the main Gromacs master branch: https://github.com/gromacs/gromacs/commit/f2b9db2
Evaluate the performance of the new kernel for AMD and NVIDIA GPUs and decide on the final version or versions of the OpenCL kernel that will be used.