Closed luraess closed 6 months ago
This is an important point, thank you @luraess !! We have to make a few modifications on AMDGPU according to the "new" syntax. Apart from changing the groupsize and gridsize, I saw that there are some changes for device synchronization. Not sure, if it is still assumed that the launch of the kernels are synchronous by default or not, that is another test to do before using the new version of AMDGPU.jl.
We have to make a few modifications on AMDGPU according to the "new" syntax.
Yes, and I suspect this may address part of your perf issue you report wrt to tuning kernel launch params.
changes for device synchronization
The behaviour of AMDGPU wrt synchronisation should be fairly similar to CUDA. Unless specified otherwise, kernel are launched on the task local default device and normal priority stream. On the default device and default queue, kernel execution is ordered (and would not need explicit synchronisation). AMDGPU.synchronize()
, as CUDA, syncs the streams and not the device. A lower-level device sync is also available but is a heavier operations.
Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params;
gridsize = blocks
and thus https://github.com/JuliaORNL/JACC.jl/blob/041c271a8711cbf02fb9ce3f5d7043f60daade8c/ext/JACCAMDGPU/JACCAMDGPU.jl#L7-L9 should be:Regarding the issue https://github.com/JuliaGPU/AMDGPU.jl/issues/614 it could be that using weakdeps (adding Project.toml to
/test
) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).