JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
283 stars 47 forks source link

@atomic is slow within AMDGPU.jl #569

Closed OsKnoth closed 8 months ago

OsKnoth commented 11 months ago

I have noticed that with KernelAbstractions the use of @atomic is very slow on AMDGPU compared to CUDA.

I have a test example at https://github.com/CliMA/CGDycore.jl/tree/main/TestAMD with results for a kernel with @atomic and the same kernel with removed @atomic macros. You will find timing results in InfoCUDAAMD. There is an additional data file with the global indices for the atomic gather operation which is input for the testExample. Before running the code you have to comment out the right backend.

All test are done on the LUMI supercomputer (https://lumi-supercomputer.eu/)

After login you get the following message: rocm/5.3.3: This module is not installed the way it is intended by AMD. Hence some parts of ROCm may be broken or offer slower than expected performance and there is nothing the LUMI User Support Team can do about that. As the module is also not an official HPE Cray module there is no guarantee that it will integrate properly with HPE Cray PE modules or work correctly in that environment. Note that HPE Cray PE programming environments up to at least 23.02 have only been tested with ROCm version prior to 5.3 by HPE.

luraess commented 11 months ago

After login you get the following message:

This message you are getting after loading the specific ROCm module combination to have 5.3.3 instead of 5.2 exposed. From LUMI support, there should not be any issue with this config, besides it may not be fully tested by HPE. I don't think this relates anyhow to the slowdown you are experiencing using atomic operations in AMDGPU.jl.

albert-de-montserrat commented 11 months ago

I will chip in the issue. I have the same problem (tested on Lumi as well) with AMDGPU and atomics. If it helps, this is my MWE using ParallelStencil.jl instead of KernelAbstractions.jl

using ParallelStencil, StaticArrays, Atomix, BenchmarkTools
const use_AMD = true

const TA = @static if use_AMD 
    @init_parallel_stencil(AMDGPU, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            AMDGPU.@atomic $expr
        end)
    end

    ParallelStencil.ROCArray
else
    @init_parallel_stencil(Threads, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            Atomix.@atomic $expr
        end)
    end

    Array
end

function init_particles(nxcell, x, y, dx, dy, nx, ny)
    ncells = nx * ny
    np = nxcell * ncells
    pcoords = TA([SA[(rand()*0.9 + 0.05)*dx, (rand()*0.9 + 0.05)*dy] for _ in 1:np])
    parent_cell = TA([(0, 0) for _ in 1:np])

    nx2, ny2 = length(x)-1, length(y)-1
    LinInd = LinearIndices((1:nx2, 1:ny2))

    @parallel_indices (i, j) function fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
        k = LinInd[i, j] 
        x0, y0 = x[i], y[j] # lower-left corner of the cell
        for l in (1 + nxcell * (k-1)):(nxcell * k)
            pcoords[l] = @SVector [x0 + pcoords[l][1], y0 + pcoords[l][2]]
            parent_cell[l] = (i, j)
        end
        return nothing
    end

    @parallel (1:nx, 1:ny) fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)

    return pcoords, parent_cell
end

function foo!(F, Fp, buffer, parent_cell)
    np = length(Fp)
    @parallel (1:np) _foo!(F, Fp, buffer, parent_cell)
    return nothing
end

@parallel_indices (ipart) function _foo!(F, Fp, buffer, parent_cell)
    inode, jnode = parent_cell[ipart]
    # iterate over cells around ipart-th node
    for joffset in 0:1
        jvertex = joffset + jnode
        for ioffset in 0:1
            ivertex = ioffset + inode
            @myatomic F[ivertex, jvertex] += ω_i = rand() # a weight function is actually called here
            @myatomic buffer[ivertex, jvertex] += Fp[ipart] * ω_i 
        end
    end

    return nothing
end

let 
    np = 24
    n = 128
    nx = ny = n-1
    ni = nx , ny
    Lx = Ly = 1.0
    xvi = xv, yv = range(0, Lx, length=n), range(0, Ly, length=n)
    dxi = dx, dy = xv[2] - xv[1], yv[2] - yv[1]

    pcoords, parent_cell = init_particles(np, xvi..., dx, dy, nx, ny)

    F = @zeros(ni.+1...)
    buffer = similar(F)
    Fp = @rand(length(pcoords))

    @btime foo!($(F, Fp, buffer, parent_cell)...)
    # AMDGPU with atomics:    412.973 ms (45 allocations: 2.80 KiB)
    # AMDGPU without atomics:  82.009 μs (45 allocations: 2.80 KiB)
end
vchuravy commented 11 months ago

I will note that you are using floating point atomics and the OP uses integer atomics. There was an recommendation from LUMI to try out unsafe-atomics.

E.g. https://reviews.llvm.org/D91546?id=305522

OsKnoth commented 11 months ago

How is this option added to the Julia call?

luraess commented 11 months ago

With @albert-de-montserrat we worked on a MWE comparing AMDGPU, ROCBackend() and HIP C++ perf on the following micro kernel which may be somewhat representative of what happens in @albert-de-montserrat more complete example 👆

function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    AMDGPU.@atomic target1[i1] += v
    AMDGPU.@atomic target1[i2] += v
    AMDGPU.@atomic target1[i3] += v
    AMDGPU.@atomic target1[i4] += v
    AMDGPU.@atomic target2[i1] += v
    AMDGPU.@atomic target2[i2] += v
    AMDGPU.@atomic target2[i3] += v
    AMDGPU.@atomic target2[i4] += v
    return
end

the HIP C++ version would be

__global__ void atomic_kernel(DAT *_target1, DAT *_target2, DAT *_source, int *_indices, const int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int i1 = indices(i, 0);
  int i2 = indices(i, 1);
  int i3 = indices(i, 2);
  int i4 = indices(i, 3);
  DAT v = source(i);
  atomicAdd(&target1(i1), v);
  atomicAdd(&target1(i2), v);
  atomicAdd(&target1(i3), v);
  atomicAdd(&target1(i4), v);
  atomicAdd(&target2(i1), v);
  atomicAdd(&target2(i2), v);
  atomicAdd(&target2(i3), v);
  atomicAdd(&target2(i4), v);
}

Running these codes on LUMI (MI250x - ROCm 5.3.3) and a Radeon 7800XT (Navi3 - ROCm 5.6.1) gives following timings:

Float64

Radeon 7800 XT

  "julia" => Trial(43.370 μs)
  "reference" => Trial(38.450 μs)
  "julia-ka" => Trial(45.190 μs)

On LUMI

  "julia" => Trial(345.640 μs)
  "reference" => Trial(49.636 μs)
  "julia-ka" => Trial(350.620 μs)

On LUMI with -xhip -munsafe-fp-atomics (for C++ code only)

  "julia" => Trial(351.731 μs)
  "reference" => Trial(12.635 μs)
  "julia-ka" => Trial(357.743 μs)

Similar behaviour is observed for Float32

UInt64

Radeon 7800

  "julia" => Trial(23.400 μs)
  "reference" => Trial(18.530 μs)
  "julia-ka" => Trial(25.320 μs)

LUMI

  "julia" => Trial(39.395 μs)
  "reference" => Trial(12.714 μs)
  "julia-ka" => Trial(42.992 μs)

LUMI with -xhip -munsafe-fp-atomics

  "julia" => Trial(39.085 μs)
  "reference" => Trial(12.624 μs)
  "julia-ka" => Trial(42.531 μs)

Similar behaviour is observed for UInt32

Files can be found here https://github.com/PTsolvers/HPCBenchmarks.jl/pull/1 (CUDA version added soon)

vchuravy commented 11 months ago

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only? Also @device_code dir="ir" for Julia on LUMI would be helpful.

luraess commented 11 months ago

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only? Also @device_code dir="ir" for Julia on LUMI would be helpful.

Getting following warning and compilation failure when adding the suggested flags to https://github.com/PTsolvers/HPCBenchmarks.jl/blob/8b8488231a0d8279233d52866f549d2302e81b82/AMDGPU/atomic.jl#L91

clang-15: warning: -lgcc_s: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lgcc: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lpthread: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lm: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lrt: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,--enable-new-dtags: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,-rpath=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/hip/lib:/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lamdhip64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lclang_rt.builtins-x86_64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: argument unused during compilation: '--shared' [-Wunused-command-line-argument]
vchuravy commented 11 months ago

That should be okay. I think you might also need -c since you don't want to link.

albert-de-montserrat commented 11 months ago

The --offload-device-only flag still doesn't work for me

clang-14: error: unsupported option '--offload-device-only'

The @device_code for Julia is here https://gist.github.com/albert-de-montserrat/ab4a66b1e5a1b547673041e4c5f190e2

luraess commented 11 months ago

Still getting the error

ERROR: LoadError: could not load library "./atomic.so"
./atomic.so: invalid ELF header

in

run(`hipcc -emit-llvm -S --offload-device-only -c  -xhip -munsafe-fp-atomics -O3 -o $libname --shared -fPIC atomic.cu`)

Libdl.dlopen("./$libname") do lib
vchuravy commented 11 months ago

You don't want --shared -fIC since your -o should be a .ll

vchuravy commented 11 months ago

So we are just emitting atomicrmw fadd double addrspace(1)* %73, double %51 seq_cst, align 8, !dbg !374

So the issue might be that we emit seq_cst instead of acq_rel

luraess commented 11 months ago

You don't want --shared -fIC since your -o should be a .ll, hipcc -xhip -munsafe-fp-atomics -O3 -o out.ll atomic.cu

If I remove it and change the -o from .so to .ll I am then getting

ld.lld: error: undefined symbol: main
>>> referenced by start.S:104 (../sysdeps/x86_64/start.S:104)
>>>               /usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/crt1.o:(_start)
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
ERROR: LoadError: failed process: Process(`hipcc -xhip -munsafe-fp-atomics -O3 -o atomic.ll atomic.cu`, ProcessExited(1)) [1]

The "main" function of the external cpp code being extern "C" EXPORT_API void run_benchmark(){...}

vchuravy commented 11 months ago

But now you are missing -emit-llvm -S --offload-device-only -c?

luraess commented 11 months ago

But now you are missing -emit-llvm -S --offload-device-only -c?

Ok, got it now. Was confused. The generated out.ll is out.ll.zip

vchuravy commented 11 months ago

Can you also send me the one withtout unsafe-fp-atomics, but this is already promising:

  %41 = load i32, i32 addrspace(1)* %23, align 4, !tbaa !11, !amdgpu.noclobber !5
  %42 = load i32, i32 addrspace(1)* %27, align 4, !tbaa !11, !amdgpu.noclobber !5
  %43 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %37, double %29) #4
  %44 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %40, double %29) #4
luraess commented 11 months ago

Here without unsafe atomics out_nounsafe.ll.zip

vchuravy commented 11 months ago

Oh wow that is even more interesting.

  %41 = atomicrmw fadd double addrspace(1)* %36, double %29 syncscope("agent-one-as") monotonic, align 8
  %42 = atomicrmw fadd double addrspace(1)* %38, double %29 syncscope("agent-one-as") monotonic, align 8

That is easier. That means AMDGPU.jl is emiting atomics with a stronger ordering than needed.

Try:

using Atomix: @atomic, @atomicswap, @atomicreplace

function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    @atomic :montonic target1[i1] += v
    @atomic :montonic target1[i2] += v
    @atomic :montonic target1[i3] += v
    @atomic :montonic target1[i4] += v
    @atomic :montonic target2[i1] += v
    @atomic :montonic target2[i2] += v
    @atomic :montonic target2[i3] += v
    @atomic :montonic target2[i4] += v
    return
end

The only thing we don't handle there is syncscope so we might need to add that. (Atomix is the device portable interface for atomic arrays accesses, which is why I recommend it for KA)

vchuravy commented 11 months ago

This is consistent with Base.@atomic:

help?> @atomic
  @atomic var
  @atomic order ex

  Mark var or ex as being performed atomically, if ex is a supported expression. If no
  order is specified it defaults to :sequentially_consistent.

AMDGPU uses Atomix.jl under the hood.

The unsafe-fp-atomics may be something to explore, but I would bet that biggest issue is the different default choice for atomic order.

luraess commented 11 months ago

Thanks - will check the benchmarks and report

luraess commented 11 months ago

Adding the :monotonic keyword to the @atomic macro in Atomix (used in KA lernel) and AMDGPU, results in following results (NOT using unsafe-fp-atomics in the C++ reference):

On LUMI (Float64)

  "julia" => Trial(53.444 μs)
  "reference" => Trial(49.606 μs)
  "julia-ka" => Trial(58.644 μs)

No effect/change on the Radeon 7800XT

luraess commented 11 months ago

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

vchuravy commented 11 months ago

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

I was thinking about this, but not really. The default is correct and needs to be consistent across architectures. The docs could probably be improved to call this out.

Fundamentally the problem here is that we get lowered to a cmpxchg loop and so we might need to look at syncscope as well.

albert-de-montserrat commented 11 months ago

Thanks @vchuravy ! This also speeds up my MWE by 400 times :)

OsKnoth commented 11 months ago

Thanks for the effort. My code speeds up, but is yet 3 times slower than on an A100.

albert-de-montserrat commented 11 months ago

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):

  "julia" => Trial(17.575 μs)
  "reference" => Trial(9.815 μs)
  "julia-ka" => Trial(25.806 μs)
luraess commented 11 months ago

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):


  "julia" => Trial(17.575 μs)

  "reference" => Trial(9.815 μs)

  "julia-ka" => Trial(25.806 μs)

What are the timings without the flag?

albert-de-montserrat commented 11 months ago

Those are without the unsafe flag, I don't know what is the equivalent flag for CUDA. I get the same timings with and without :monotonic here.

vchuravy commented 11 months ago

Sorry could we keep the conversation focused here? If there is an issue with atomic operations in CUDA.jl please open an issue there.

I assume that:

My code speeds up, but is yet 3 times slower than on an A100.

Means there is a performance between A100 and MI250X? It is hard to tell what the cause for this is from afar. The first thing I would try is to write the code in HIP and make sure that the performance different is apparent there also, if it isn't then we can investigate further.

pxl-th commented 8 months ago

I think with #604 we can close this. Feel free to re-open though, if needed.