JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
21 stars 13 forks source link

Improve block and thread calculations and invoke only if in range #76

Closed PhilipFackler closed 6 months ago

PhilipFackler commented 6 months ago

This only applies to 1d parallel_for in CUDA. If this is acceptable to you all, a similar thing should be done where else it's applicable.

PhilipFackler commented 6 months ago

See #57. This is a problem in 1d also. It's actually not "wrong". It's just that if your N is greater than the threads per block and not divisible by the same, you pass indices to the kernel that are out of bounds. Adding the conditional doesn't seem to affect performance, which makes sense, since all threads in a block (except the last one) would follow the true branch.

williamfgc commented 6 months ago

Thanks @PhilipFackler eventually we will revisit this when doing performance testing, with the apps we are targeting.

williamfgc commented 6 months ago

Test this please

williamfgc commented 6 months ago

Also, are we missing the same for AMDGPU.jl?

PhilipFackler commented 6 months ago

Also, are we missing the same for AMDGPU.jl?

Yes, but I don't have a way to check that out locally