Closed OsKnoth closed 8 months ago
After login you get the following message:
This message you are getting after loading the specific ROCm module combination to have 5.3.3 instead of 5.2 exposed. From LUMI support, there should not be any issue with this config, besides it may not be fully tested by HPE. I don't think this relates anyhow to the slowdown you are experiencing using atomic operations in AMDGPU.jl.
I will chip in the issue. I have the same problem (tested on Lumi as well) with AMDGPU and atomics. If it helps, this is my MWE using ParallelStencil.jl
instead of KernelAbstractions.jl
using ParallelStencil, StaticArrays, Atomix, BenchmarkTools
const use_AMD = true
const TA = @static if use_AMD
@init_parallel_stencil(AMDGPU, Float64, 2)
macro myatomic(expr)
return esc(quote
AMDGPU.@atomic $expr
end)
end
ParallelStencil.ROCArray
else
@init_parallel_stencil(Threads, Float64, 2)
macro myatomic(expr)
return esc(quote
Atomix.@atomic $expr
end)
end
Array
end
function init_particles(nxcell, x, y, dx, dy, nx, ny)
ncells = nx * ny
np = nxcell * ncells
pcoords = TA([SA[(rand()*0.9 + 0.05)*dx, (rand()*0.9 + 0.05)*dy] for _ in 1:np])
parent_cell = TA([(0, 0) for _ in 1:np])
nx2, ny2 = length(x)-1, length(y)-1
LinInd = LinearIndices((1:nx2, 1:ny2))
@parallel_indices (i, j) function fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
k = LinInd[i, j]
x0, y0 = x[i], y[j] # lower-left corner of the cell
for l in (1 + nxcell * (k-1)):(nxcell * k)
pcoords[l] = @SVector [x0 + pcoords[l][1], y0 + pcoords[l][2]]
parent_cell[l] = (i, j)
end
return nothing
end
@parallel (1:nx, 1:ny) fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
return pcoords, parent_cell
end
function foo!(F, Fp, buffer, parent_cell)
np = length(Fp)
@parallel (1:np) _foo!(F, Fp, buffer, parent_cell)
return nothing
end
@parallel_indices (ipart) function _foo!(F, Fp, buffer, parent_cell)
inode, jnode = parent_cell[ipart]
# iterate over cells around ipart-th node
for joffset in 0:1
jvertex = joffset + jnode
for ioffset in 0:1
ivertex = ioffset + inode
@myatomic F[ivertex, jvertex] += ω_i = rand() # a weight function is actually called here
@myatomic buffer[ivertex, jvertex] += Fp[ipart] * ω_i
end
end
return nothing
end
let
np = 24
n = 128
nx = ny = n-1
ni = nx , ny
Lx = Ly = 1.0
xvi = xv, yv = range(0, Lx, length=n), range(0, Ly, length=n)
dxi = dx, dy = xv[2] - xv[1], yv[2] - yv[1]
pcoords, parent_cell = init_particles(np, xvi..., dx, dy, nx, ny)
F = @zeros(ni.+1...)
buffer = similar(F)
Fp = @rand(length(pcoords))
@btime foo!($(F, Fp, buffer, parent_cell)...)
# AMDGPU with atomics: 412.973 ms (45 allocations: 2.80 KiB)
# AMDGPU without atomics: 82.009 μs (45 allocations: 2.80 KiB)
end
I will note that you are using floating point atomics and the OP uses integer atomics. There was an recommendation from LUMI to try out unsafe-atomics.
How is this option added to the Julia call?
With @albert-de-montserrat we worked on a MWE comparing AMDGPU, ROCBackend() and HIP C++ perf on the following micro kernel which may be somewhat representative of what happens in @albert-de-montserrat more complete example 👆
function amd_atomic_add!(target1, target2, source, indices)
i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
v = source[i]
AMDGPU.@atomic target1[i1] += v
AMDGPU.@atomic target1[i2] += v
AMDGPU.@atomic target1[i3] += v
AMDGPU.@atomic target1[i4] += v
AMDGPU.@atomic target2[i1] += v
AMDGPU.@atomic target2[i2] += v
AMDGPU.@atomic target2[i3] += v
AMDGPU.@atomic target2[i4] += v
return
end
the HIP C++ version would be
__global__ void atomic_kernel(DAT *_target1, DAT *_target2, DAT *_source, int *_indices, const int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int i1 = indices(i, 0);
int i2 = indices(i, 1);
int i3 = indices(i, 2);
int i4 = indices(i, 3);
DAT v = source(i);
atomicAdd(&target1(i1), v);
atomicAdd(&target1(i2), v);
atomicAdd(&target1(i3), v);
atomicAdd(&target1(i4), v);
atomicAdd(&target2(i1), v);
atomicAdd(&target2(i2), v);
atomicAdd(&target2(i3), v);
atomicAdd(&target2(i4), v);
}
Running these codes on LUMI (MI250x - ROCm 5.3.3) and a Radeon 7800XT (Navi3 - ROCm 5.6.1) gives following timings:
Radeon 7800 XT
"julia" => Trial(43.370 μs)
"reference" => Trial(38.450 μs)
"julia-ka" => Trial(45.190 μs)
On LUMI
"julia" => Trial(345.640 μs)
"reference" => Trial(49.636 μs)
"julia-ka" => Trial(350.620 μs)
On LUMI with -xhip -munsafe-fp-atomics
(for C++ code only)
"julia" => Trial(351.731 μs)
"reference" => Trial(12.635 μs)
"julia-ka" => Trial(357.743 μs)
Similar behaviour is observed for Float32
Radeon 7800
"julia" => Trial(23.400 μs)
"reference" => Trial(18.530 μs)
"julia-ka" => Trial(25.320 μs)
LUMI
"julia" => Trial(39.395 μs)
"reference" => Trial(12.714 μs)
"julia-ka" => Trial(42.992 μs)
LUMI with -xhip -munsafe-fp-atomics
"julia" => Trial(39.085 μs)
"reference" => Trial(12.624 μs)
"julia-ka" => Trial(42.531 μs)
Similar behaviour is observed for UInt32
Files can be found here https://github.com/PTsolvers/HPCBenchmarks.jl/pull/1 (CUDA version added soon)
Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics
Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only
?
Also @device_code dir="ir"
for Julia on LUMI would be helpful.
Okay so on LUMI the C++ code is already doing something quite different from Julia even without
unsafe-fp-atomics
Can you get the LLVM IR from hipcc on LUMI with something like
-emit-llvm -S --offload-device-only
? Also@device_code dir="ir"
for Julia on LUMI would be helpful.
Getting following warning and compilation failure when adding the suggested flags to https://github.com/PTsolvers/HPCBenchmarks.jl/blob/8b8488231a0d8279233d52866f549d2302e81b82/AMDGPU/atomic.jl#L91
clang-15: warning: -lgcc_s: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lgcc: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lpthread: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lm: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lrt: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,--enable-new-dtags: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,-rpath=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/hip/lib:/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lamdhip64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lclang_rt.builtins-x86_64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: argument unused during compilation: '--shared' [-Wunused-command-line-argument]
That should be okay. I think you might also need -c
since you don't want to link.
The --offload-device-only
flag still doesn't work for me
clang-14: error: unsupported option '--offload-device-only'
The @device_code
for Julia is here https://gist.github.com/albert-de-montserrat/ab4a66b1e5a1b547673041e4c5f190e2
Still getting the error
ERROR: LoadError: could not load library "./atomic.so"
./atomic.so: invalid ELF header
in
run(`hipcc -emit-llvm -S --offload-device-only -c -xhip -munsafe-fp-atomics -O3 -o $libname --shared -fPIC atomic.cu`)
Libdl.dlopen("./$libname") do lib
You don't want --shared -fIC
since your -o
should be a .ll
So we are just emitting atomicrmw fadd double addrspace(1)* %73, double %51 seq_cst, align 8, !dbg !374
So the issue might be that we emit seq_cst
instead of acq_rel
You don't want
--shared -fIC
since your-o
should be a.ll
,hipcc -xhip -munsafe-fp-atomics -O3 -o out.ll atomic.cu
If I remove it and change the -o
from .so
to .ll
I am then getting
ld.lld: error: undefined symbol: main
>>> referenced by start.S:104 (../sysdeps/x86_64/start.S:104)
>>> /usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/crt1.o:(_start)
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
ERROR: LoadError: failed process: Process(`hipcc -xhip -munsafe-fp-atomics -O3 -o atomic.ll atomic.cu`, ProcessExited(1)) [1]
The "main" function of the external cpp code being extern "C" EXPORT_API void run_benchmark(){...}
But now you are missing -emit-llvm -S --offload-device-only -c
?
But now you are missing
-emit-llvm -S --offload-device-only -c
?
Ok, got it now. Was confused. The generated out.ll is out.ll.zip
Can you also send me the one withtout unsafe-fp-atomics
, but this is already promising:
%41 = load i32, i32 addrspace(1)* %23, align 4, !tbaa !11, !amdgpu.noclobber !5
%42 = load i32, i32 addrspace(1)* %27, align 4, !tbaa !11, !amdgpu.noclobber !5
%43 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %37, double %29) #4
%44 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %40, double %29) #4
Here without unsafe atomics out_nounsafe.ll.zip
Oh wow that is even more interesting.
%41 = atomicrmw fadd double addrspace(1)* %36, double %29 syncscope("agent-one-as") monotonic, align 8
%42 = atomicrmw fadd double addrspace(1)* %38, double %29 syncscope("agent-one-as") monotonic, align 8
That is easier. That means AMDGPU.jl is emiting atomics with a stronger ordering than needed.
Try:
using Atomix: @atomic, @atomicswap, @atomicreplace
function amd_atomic_add!(target1, target2, source, indices)
i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
v = source[i]
@atomic :montonic target1[i1] += v
@atomic :montonic target1[i2] += v
@atomic :montonic target1[i3] += v
@atomic :montonic target1[i4] += v
@atomic :montonic target2[i1] += v
@atomic :montonic target2[i2] += v
@atomic :montonic target2[i3] += v
@atomic :montonic target2[i4] += v
return
end
The only thing we don't handle there is syncscope
so we might need to add that.
(Atomix is the device portable interface for atomic arrays accesses, which is why I recommend it for KA)
This is consistent with Base.@atomic
:
help?> @atomic
@atomic var
@atomic order ex
Mark var or ex as being performed atomically, if ex is a supported expression. If no
order is specified it defaults to :sequentially_consistent.
AMDGPU uses Atomix.jl under the hood.
The unsafe-fp-atomics
may be something to explore, but I would bet that biggest issue is the different default choice for atomic order.
Thanks - will check the benchmarks and report
Adding the :monotonic
keyword to the @atomic
macro in Atomix (used in KA lernel) and AMDGPU, results in following results (NOT using unsafe-fp-atomics
in the C++ reference):
On LUMI (Float64)
"julia" => Trial(53.444 μs)
"reference" => Trial(49.606 μs)
"julia-ka" => Trial(58.644 μs)
No effect/change on the Radeon 7800XT
Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?
Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?
I was thinking about this, but not really. The default is correct and needs to be consistent across architectures. The docs could probably be improved to call this out.
Fundamentally the problem here is that we get lowered to a cmpxchg loop and so we might need to look at syncscope as well.
Thanks @vchuravy ! This also speeds up my MWE by 400 times :)
Thanks for the effort. My code speeds up, but is yet 3 times slower than on an A100.
Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):
"julia" => Trial(17.575 μs)
"reference" => Trial(9.815 μs)
"julia-ka" => Trial(25.806 μs)
Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):
"julia" => Trial(17.575 μs) "reference" => Trial(9.815 μs) "julia-ka" => Trial(25.806 μs)
What are the timings without the flag?
Those are without the unsafe flag, I don't know what is the equivalent flag for CUDA. I get the same timings with and without :monotonic
here.
Sorry could we keep the conversation focused here? If there is an issue with atomic operations in CUDA.jl please open an issue there.
I assume that:
My code speeds up, but is yet 3 times slower than on an A100.
Means there is a performance between A100 and MI250X? It is hard to tell what the cause for this is from afar. The first thing I would try is to write the code in HIP and make sure that the performance different is apparent there also, if it isn't then we can investigate further.
I think with #604 we can close this. Feel free to re-open though, if needed.
I have noticed that with KernelAbstractions the use of @atomic is very slow on AMDGPU compared to CUDA.
I have a test example at https://github.com/CliMA/CGDycore.jl/tree/main/TestAMD with results for a kernel with @atomic and the same kernel with removed @atomic macros. You will find timing results in InfoCUDAAMD. There is an additional data file with the global indices for the atomic gather operation which is input for the testExample. Before running the code you have to comment out the right backend.
All test are done on the LUMI supercomputer (https://lumi-supercomputer.eu/)
After login you get the following message: rocm/5.3.3: This module is not installed the way it is intended by AMD. Hence some parts of ROCm may be broken or offer slower than expected performance and there is nothing the LUMI User Support Team can do about that. As the module is also not an official HPE Cray module there is no guarantee that it will integrate properly with HPE Cray PE modules or work correctly in that environment. Note that HPE Cray PE programming environments up to at least 23.02 have only been tested with ROCm version prior to 5.3 by HPE.