JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
21 stars 13 forks source link

AMDGPU tweaks #63

Closed luraess closed 6 months ago

luraess commented 6 months ago

Looking into the scripts, I see you are using AMDGPU v0.8 which has now the same "convention" as CUDA wrt using threads and blocks as kernel launch params; gridsize = blocks and thus https://github.com/JuliaORNL/JACC.jl/blob/041c271a8711cbf02fb9ce3f5d7043f60daade8c/ext/JACCAMDGPU/JACCAMDGPU.jl#L7-L9 should be:

@roc groupsize = threads gridsize = blocks _parallel_for_amdgpu(f, x...) 

Regarding the issue https://github.com/JuliaGPU/AMDGPU.jl/issues/614 it could be that using weakdeps (adding Project.toml to /test) in tests could solve the issue as since JACC is using extensions one could make sure not relying on conditional loading but truly on extension mechanism?

Also, it looks like you are running AMDGPU CI on Julia 1.9. There used to be issues because of LLVM on Julia 1.9 and thus Julia 1.10 could globally be preferred (although depending on the GPU 1.9 would work fine).

pedrovalerolara commented 6 months ago

This is an important point, thank you @luraess !! We have to make a few modifications on AMDGPU according to the "new" syntax. Apart from changing the groupsize and gridsize, I saw that there are some changes for device synchronization. Not sure, if it is still assumed that the launch of the kernels are synchronous by default or not, that is another test to do before using the new version of AMDGPU.jl.

luraess commented 6 months ago

We have to make a few modifications on AMDGPU according to the "new" syntax.

Yes, and I suspect this may address part of your perf issue you report wrt to tuning kernel launch params.

changes for device synchronization

The behaviour of AMDGPU wrt synchronisation should be fairly similar to CUDA. Unless specified otherwise, kernel are launched on the task local default device and normal priority stream. On the default device and default queue, kernel execution is ordered (and would not need explicit synchronisation). AMDGPU.synchronize(), as CUDA, syncs the streams and not the device. A lower-level device sync is also available but is a heavier operations.