Closed PhilipFackler closed 6 months ago
See #57. This is a problem in 1d also. It's actually not "wrong". It's just that if your N
is greater than the threads per block and not divisible by the same, you pass indices to the kernel that are out of bounds. Adding the conditional doesn't seem to affect performance, which makes sense, since all threads in a block (except the last one) would follow the true
branch.
Thanks @PhilipFackler eventually we will revisit this when doing performance testing, with the apps we are targeting.
Test this please
Also, are we missing the same for AMDGPU.jl?
Also, are we missing the same for AMDGPU.jl?
Yes, but I don't have a way to check that out locally
This only applies to 1d parallel_for in CUDA. If this is acceptable to you all, a similar thing should be done where else it's applicable.