MI300X (gfx942) support for broadcast operations

joelandman commented 5 months ago

Simple reproducer, not sure if this specific use case is supported or not. CPU and GPU versions for comparison. MI300X GPU, Ubuntu 22.04. ROCm 6.1 pre-release.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8* (2024-03-01 10:14 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 9354 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 8 default, 0 interactive, 4 GC (on 128 virtual cores)
Environment:
  LD_LIBRARY_PATH = /home/amd/local/lib::/home/amd/local/lib:/home/amd/.npm_modules/lib

using AMDGPU

julia> AMDGPU.devices()
┌────┬─────────────────────┬────────────────────────┬───────────┬─────────────┐
│ Id │                Name │               GCN arch │ Wavefront │      Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼─────────────┤
│  1 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  2 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  3 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  4 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  5 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  6 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  7 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
│  8 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
└────┴─────────────────────┴────────────────────────┴───────────┴─────────────┘

# CPU version
a_h = rand(Float16,5,5)
z_h = a_h .- Float16(0.5)

# GPU version 1
a_d = ROCMatrix(rand(Float16,5,5))
z_d = a_d .- Float16(0.5)

# GPU version 2
b_d = AMDGPU.rand(Float16,5,5)
y_d = b_d .- Float16(0.5)

The a_h and z_h are as expected.

julia> # CPU version
       a_h = rand(Float16,5,5)
5×5 Matrix{Float16}:
 0.0796  0.5674  0.3735  0.588    0.1387
 0.3408  0.747   0.1177  0.01953  0.165
 0.962   0.4517  0.1626  0.834    0.1772
 0.1313  0.248   0.0947  0.311    0.46
 0.51    0.6123  0.593   0.1958   0.356

julia> z_h = a_h .- Float16(0.5)
5×5 Matrix{Float16}:
 -0.4204     0.0674   -0.1265   0.0879  -0.3613
 -0.1592     0.2471   -0.3823  -0.4805  -0.335
  0.462     -0.04834  -0.3374   0.334   -0.3228
 -0.3687    -0.252    -0.4053  -0.189   -0.04004
  0.009766   0.1123    0.0928  -0.3042  -0.144

The a_d and b_d are properly set, though the subtraction yields this

julia> # GPU version 1
       a_d = ROCMatrix(rand(Float16,5,5))
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.4282  0.3154    0.796    0.391    0.6763
 0.413   0.9087    0.791    0.613    0.5547
 0.768   0.004883  0.09033  0.12305  0.9023
 0.6484  0.4707    0.827    0.9595   0.8643
 0.3164  0.2783    0.4043   0.2222   0.9355

julia> z_d = a_d .- Float16(0.5)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
warning: sramecc 'On' was requested for a processor that does not support it!
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
warning: sramecc 'On' was requested for a processor that does not support it!
ERROR: LLVM error: Cannot select: 0x55d229b85998: i32,ch = load<(dereferenceable invariant load (s8) from %ir..kernarg.offset7.cast + 33, basealign 8, addrspac                                e 4), zext from i8> 0x55d22a1a9d88, 0x55d228b82c20, undef:i64
  0x55d228b82c20: i64 = add 0x55d22e9b76d0, Constant:i64<153>
    0x55d22e9b76d0: i64,ch = CopyFromReg 0x55d22a1a9d88, Register:i64 %0
      0x55d22e9b7390: i64 = Register %0
    0x55d228b829b0: i64 = Constant<153>
  0x55d22a33edf0: i64 = undef
In function: _Z3_3516ROCKernelContext14ROCDeviceArrayI7Float16Li2ELi1EE11BroadcastedI13ROCArrayStyleILi2E9HIPBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8Extrud                                edIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_IS7_S7_EES1_EES7_
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/bzSzE/src/core/context.jl:168
  [2] LLVMTargetMachineEmitToMemoryBuffer(T::LLVM.TargetMachine, M::LLVM.Module, codegen::LLVM.API.LLVMCodeGenFileType, ErrorMessage::Base.RefValue{…}, OutMemB                                uf::Base.RefValue{…})
    @ LLVM.API ~/.julia/packages/LLVM/bzSzE/lib/15/libLLVM.jl:4241
  [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
    @ LLVM ~/.julia/packages/LLVM/bzSzE/src/targetmachine.jl:45
  [4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/mcgen.jl:84
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:466 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:463 [inlined]
  [9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/utils.jl:92
 [10] emit_asm
    @ ~/.julia/packages/GPUCompiler/kqxyC/src/utils.jl:86 [inlined]
 [11]
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:154
 [12] codegen
    @ ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:115 [inlined]
 [13]
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:111
 [14] compile
    @ ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:103 [inlined]
 [15] #40
    @ ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:172 [inlined]
 [16] JuliaContext(f::AMDGPU.Compiler.var"#40#41"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}}; kwargs::@Kwargs{}                                )
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:52
 [17] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/driver.jl:42
 [18] hipcompile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:171
 [19] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(AMDGPU.Compiler.hipcompi                                le), linker::typeof(AMDGPU.Compiler.hiplink))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/execution.jl:128
 [20] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/kqxyC/src/execution.jl:103
 [21] macro expansion
    @ ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:139 [inlined]
 [22] macro expansion
    @ ./lock.jl:267 [inlined]
 [23] hipfunction(f::GPUArrays.var"#35#37", tt::Type{Tuple{…}}; kwargs::@Kwargs{name::Nothing})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:133
 [24] hipfunction
    @ ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:132 [inlined]
 [25] macro expansion
    @ ~/.julia/packages/AMDGPU/gtxsf/src/highlevel.jl:172 [inlined]
 [26] #gpu_call#48
    @ ~/.julia/packages/AMDGPU/gtxsf/src/gpuarrays.jl:8 [inlined]
 [27] gpu_call
    @ ~/.julia/packages/AMDGPU/gtxsf/src/gpuarrays.jl:5 [inlined]
 [28] gpu_call(::GPUArrays.var"#35#37", ::ROCArray{…}, ::Base.Broadcast.Broadcasted{…}, ::Int64; target::ROCArray{…}, elements::Nothing, threads::Int64, blocks                                ::Int64, name::Nothing)
    @ GPUArrays ~/.julia/packages/GPUArrays/OKkAu/src/device/execution.jl:69
 [29] gpu_call
    @ ~/.julia/packages/GPUArrays/OKkAu/src/device/execution.jl:34 [inlined]
 [30] _copyto!
    @ ~/.julia/packages/GPUArrays/OKkAu/src/host/broadcast.jl:82 [inlined]
 [31] copyto!
    @ ~/.julia/packages/GPUArrays/OKkAu/src/host/broadcast.jl:44 [inlined]
 [32] copy
    @ ~/.julia/packages/GPUArrays/OKkAu/src/host/broadcast.jl:29 [inlined]
 [33] materialize(bc::Base.Broadcast.Broadcasted{AMDGPU.ROCArrayStyle{2, AMDGPU.Runtime.Mem.HIPBuffer}, Nothing, typeof(-), Tuple{ROCArray{…}, Float16}})
    @ Base.Broadcast ./broadcast.jl:903
 [34] top-level scope
    @ REPL[77]:1
 [35] top-level scope
    @ ~/.julia/packages/AMDGPU/gtxsf/src/tls.jl:200
Some type information was truncated. Use `show(err)` to see complete types.

joelandman commented 5 months ago

Worth noting that this works on an MI50, and an integrated GPU on 7950x.

MI50

julia> using AMDGPU

julia> AMDGPU.devices()
┌────┬────────────────────┬────────────────────────┬───────────┬────────────┐
│ Id │               Name │               GCN arch │ Wavefront │     Memory │
├────┼────────────────────┼────────────────────────┼───────────┼────────────┤
│  1 │     AMD Radeon VII │ gfx906:sramecc+:xnack- │        64 │ 15.984 GiB │
│  2 │ AMD Radeon RX 6600 │                gfx1032 │        32 │  7.984 GiB │
└────┴────────────────────┴────────────────────────┴───────────┴────────────┘

julia> # CPU version
       a_h = rand(Float16,5,5)
5×5 Matrix{Float16}:
 0.1758  0.2559  0.8525  0.0625  0.987
 0.0957  0.4429  0.949   0.593   0.4824
 0.46    0.945   0.9917  0.738   0.010254
 0.779   0.7344  0.9824  0.544   0.0332
 0.503   0.977   0.31    0.3086  0.523

julia> z_h = a_h .- Float16(0.5)

       # GPU version 1
5×5 Matrix{Float16}:
 -0.3242   -0.2441    0.3525  -0.4375    0.4868
 -0.4043   -0.05713   0.4492   0.0928   -0.01758
 -0.04004   0.4448    0.4917   0.2378   -0.4897
  0.2788    0.2344    0.4824   0.04395  -0.4668
  0.00293   0.477    -0.19    -0.1914    0.02295

julia> a_d = ROCMatrix(rand(Float16,5,5))
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.3027  0.502   0.3276  0.0796  0.456
 0.1606  0.4282  0.1875  0.816   0.2573
 0.5347  0.8003  0.5215  0.103   0.0908
 0.7695  0.8228  0.802   0.8037  0.187
 0.475   0.1553  0.608   0.8735  0.25

julia> z_d = a_d .- Float16(0.5)
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 -0.1973    0.001953  -0.1724   -0.4204  -0.04395
 -0.3394   -0.0718    -0.3125    0.316   -0.2427
  0.03467   0.3003     0.02148  -0.397   -0.4092
  0.2695    0.3228     0.3018    0.3037  -0.313
 -0.0249   -0.3447     0.1079    0.3735  -0.25

julia> # GPU version 2
       b_d = AMDGPU.rand(Float16,5,5)
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.674    0.4595  0.624    0.0912  0.821
 0.02998  0.4895  0.02676  0.385   0.4805
 0.522    0.978   0.4788   0.684   0.8164
 0.1853   0.9688  0.39     0.3337  0.5186
 0.00983  0.3857  0.4546   0.846   0.3872

julia> y_d = b_d .- Float16(0.5)
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
  0.1738   -0.04053   0.124    -0.4087   0.3208
 -0.47     -0.0105   -0.4731   -0.115   -0.01953
  0.02197   0.478    -0.02124   0.1841   0.3164
 -0.3147    0.4688   -0.1101   -0.1663   0.01855
 -0.4902   -0.11426  -0.0454    0.3462  -0.1128

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8* (2024-03-01 10:14 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver1)
Threads: 8 default, 0 interactive, 4 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/opt/rocm-6.1.0-13294/lib:/nvme/home/joe/local/lib

7950X


julia> using AMDGPU

julia> AMDGPU.devices()
┌────┬─────────────────────┬──────────┬───────────┬───────────┐
│ Id │                Name │ GCN arch │ Wavefront │    Memory │
├────┼─────────────────────┼──────────┼───────────┼───────────┤
│  1 │ AMD Radeon Graphics │  gfx1030 │        32 │ 8.000 GiB │
└────┴─────────────────────┴──────────┴───────────┴───────────┘

julia> a_h = rand(Float16,5,5)
5×5 Matrix{Float16}:
 0.2427  0.2471  0.9004  0.56     0.273
 0.5806  0.3276  0.943   0.5425   0.4692
 0.267   0.1074  0.5127  0.543    0.418
 0.708   0.8306  0.273   0.2222   0.929
 0.9204  0.5894  0.561   0.09766  0.1562

julia> z_h = a_h .- Float16(0.5)
5×5 Matrix{Float16}:
 -0.2573   -0.253     0.4004     0.06006  -0.227
  0.08057  -0.1724    0.4429     0.04248  -0.03076
 -0.2329   -0.3926    0.012695   0.04297  -0.08203
  0.208     0.3306   -0.227     -0.2778    0.4292
  0.4204    0.08936   0.06104   -0.4023   -0.3438

julia> a_d = ROCMatrix(rand(Float16,5,5))
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.6113   0.4038  0.931   0.2935  0.8135
 0.02002  0.994   0.3389  0.249   0.508
 0.1992   0.5254  0.963   0.4     0.749
 0.844    0.709   0.1333  0.3687  0.9595
 0.1138   0.4258  0.2104  0.735   0.294

julia> z_d = a_d .- Float16(0.5)

5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
  0.1113  -0.0962    0.4312  -0.2065   0.3135
 -0.48     0.4941   -0.1611  -0.251    0.007812
 -0.3008   0.02539   0.463   -0.1001   0.249
  0.3442   0.209    -0.3667  -0.1313   0.4595
 -0.3862  -0.0742   -0.2896   0.2349  -0.206

julia> # GPU version 2
       b_d = AMDGPU.rand(Float16,5,5)
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.7783  0.3125  0.989    0.4648  0.1595
 0.7236  0.7017  0.8687   0.3203  0.914
 0.962   0.72    0.03864  0.386   0.156
 0.1991  0.754   0.69     0.517   0.9272
 0.5283  0.822   0.859    0.2283  0.7993

julia> y_d = b_d .- Float16(0.5)
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
  0.2783   -0.1875   0.4888  -0.03516  -0.3403
  0.2236    0.2017   0.3687  -0.1797    0.414
  0.462     0.2202  -0.4614  -0.114    -0.344
 -0.3008    0.254    0.19     0.01709   0.4272
  0.02832   0.3218   0.359   -0.2717    0.2993

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8* (2024-03-01 10:14 UTC)
Build Info:

    Note: This is an unofficial build, please report bugs to the project
    responsible for this build and not to the Julia project unless you can
    reproduce the issue using official builds available at https://julialang.org/downloads

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 7950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 8 default, 0 interactive, 4 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/usr/local/cuda-12.3/lib64:/nvme/home/joe/local/lib
  JULIA_HOME = /nvme/home/joe/local

luraess commented 5 months ago

Do we miss something to support gfx942 @pxl-th ?

joelandman commented 5 months ago

Note: gfx942 is new and not widely available, so I didn't expect everything to work. I'm happy to work on this with you though.

pxl-th commented 5 months ago

Probably because of Julia's 1.10 LLVM version, which is 15, but gfx942 officially was added in LLVM 17 IIUC: https://github.com/llvm/llvm-project/commit/9d0572797233857397f3fdc35fffcfb490354f56

You can try Julia 1.11 early release (which has LLVM 16), but I haven't tested it at all with AMD GPUs yet. In the worst case, we'd have to wait for LLVM 17 to arrive in Julia, which is this PR: https://github.com/JuliaLang/julia/pull/53070

efaulhaber commented 5 months ago

Julia 1.11:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-beta1 (2024-04-10)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU
Precompiling AMDGPU...
Info Given AMDGPU was explicitly requested, output will be shown live 
ERROR: LoadError: UndefVarError: `CodeCache` not defined in `GPUCompiler`
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:42
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/gtxsf/src/AMDGPU.jl:75
 [3] include
   @ ./Base.jl:558 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base ./loading.jl:2721
 [5] top-level scope
   @ stdin:4
in expression starting at ~/.julia/packages/AMDGPU/gtxsf/src/AMDGPU.jl:1
in expression starting at stdin:4
  ✗ AMDGPU
  0 dependencies successfully precompiled in 5 seconds. 108 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

AMDGPU 

Failed to precompile AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e] to "~/.julia/compiled/v1.11/AMDGPU/jl_hqPvGn".
ERROR: LoadError: UndefVarError: `CodeCache` not defined in `GPUCompiler`
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:42
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/gtxsf/src/AMDGPU.jl:75
 [3] include
   @ ./Base.jl:558 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base ./loading.jl:2721
 [5] top-level scope
   @ stdin:4
in expression starting at ~/.julia/packages/AMDGPU/gtxsf/src/AMDGPU.jl:1
in expression starting at stdin:

pxl-th commented 4 months ago

AMDGPU 0.9 now supports Julia 1.11 and maybe MI300X. Just make sure to launch Julia with JULIA_LLVM_ARGS="-opaque-pointers" env variable set to use system-wide ROCm device libraries instead of our patched ones.

paulnovo commented 3 months ago

Just got a similar issue as the original post with Jullia 1.11.0-beta2, ROCm 6.1.2, and AMDGPU 0.9.5. With and without setting JULIA_LLVM_ARGS="-opaque-pointers".

julia> a_d = ROCMatrix(rand(Float16,5,5))
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.644    0.2002  0.208   0.4048   0.6567
 0.774    0.4253  0.667   0.03662  0.1997
 0.7725   0.6445  0.95    0.2876   0.715
 0.2764   0.4453  0.6836  0.4277   0.1118
 0.02197  0.5454  0.3564  0.354    0.8027

julia> z_d = a_d .- Float16(0.5)
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
...
'gfx942' is not a recognized processor for this target (ignoring processor)
'gfx942' is not a recognized processor for this target (ignoring processor)
warning: sramecc 'On' was requested for a processor that does not support it!
ERROR: InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#35#37")(::AMDGPU.ROCKernelContext, ::AMDGPU.Device.ROCDeviceMatrix{…}, ::Base.Broadcast.Broadcasted{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to pointerset(ptr::Core.LLVMPtr{T, A}, x::T, i::I, ::Val{align}) where {T, A, I, align} @ LLVM.Interop none:0)
Stacktrace:
 [1] unsafe_store! (repeats 3 times)
   @ /workspace/packages/LLVM/6cDbl/src/interop/pointer.jl:88
 [2] malloc_hc
   @ /workspace/packages/AMDGPU/OUSjX/src/device/runtime.jl:98
 [3] malloc
   @ /workspace/packages/AMDGPU/OUSjX/src/device/gcn/memory_dynamic.jl:12
 [4] malloc
   @ /workspace/packages/GPUCompiler/nWT2N/src/runtime.jl:88
 [5] macro expansion
   @ /workspace/packages/GPUCompiler/nWT2N/src/runtime.jl:183
 [6] macro expansion
   @ ./none:0
 [7] box
   @ ./none:0
 [8] box_uint64
   @ /workspace/packages/GPUCompiler/nWT2N/src/runtime.jl:212
 [9] multiple call sites
   @ unknown:0
...

I have been testing on Runpod and built a Julia-1.11-rc AMD ROCm template you can use to deploy a MI300X. I am happy to help with any debugging as well.

pxl-th commented 3 months ago

We then need Julia 1.12, which has LLVM 17 (1.11 has LLVM 16). I haven't tested it yet, as 1.11 itself is still in beta, but I can take a look shortly

paulnovo commented 3 months ago

I just built Julia from source (also added version 17 to compatible version of LLD_jll and LLVM_jll for AMDGPU), and got the same issue:

# ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.12.0-DEV.706 (2024-06-11)
 _/ |\__'_|_|_|\__'_|  |  Commit e7893a1fa4 (0 days old master)
|__/                   |

julia> versioninfo()
Julia Version 1.12.0-DEV.706
Commit e7893a1fa4 (2024-06-11 09:53 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × AMD EPYC 9474F 48-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)
Environment:
  JULIA_DEPOT_PATH = /root/

julia> using AMDGPU

julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                    │
├───────────┼──────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                               │
│     +     │ Device Libraries │ -         │ /root/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.1.40093 │ /opt/rocm/lib/libamdhip64.so                                            │
│     +     │ rocBLAS          │ 4.1.2     │ /opt/rocm/lib/librocblas.so.4                                           │
│     +     │ rocSOLVER        │ 3.25.0    │ /opt/rocm/lib/librocsolver.so.0                                         │
│     +     │ rocALUTION       │ -         │ /opt/rocm/lib/librocalution.so.1                                        │
│     +     │ rocSPARSE        │ -         │ /opt/rocm/lib/librocsparse.so.1                                         │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm/lib/librocrand.so.1                                           │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm/lib/librocfft.so.0                                            │
│     +     │ MIOpen           │ 3.1.0     │ /opt/rocm/lib/libMIOpen.so.1                                            │
└───────────┴──────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬─────────────┐
│ Id │                Name │               GCN arch │ Wavefront │      Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼─────────────┤
│  1 │ AMD Instinct MI300X │ gfx942:sramecc+:xnack- │        64 │ 191.984 GiB │
└────┴─────────────────────┴────────────────────────┴───────────┴─────────────┘

julia> a_d = ROCMatrix(rand(Float16,5,5))
5×5 ROCArray{Float16, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
 0.5596  0.292   0.8354  0.3677   0.641
 0.1567  0.978   0.4614  0.2144   0.717
 0.4023  0.8706  0.9004  0.9033   0.2319
 0.3042  0.3652  0.48    0.02197  0.1309
 0.7817  0.1909  0.4595  0.3193   0.846

julia> z_d = a_d .- Float16(0.5)
ERROR: InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#35#37")(::AMDGPU.ROCKernelContext, ::AMDGPU.Device.ROCDeviceMatrix{…}, ::Base.Broadcast.Broadcasted{…}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to pointerset(ptr::Core.LLVMPtr{T, A}, x::T, i::I, ::Val{align}) where {T, A, I, align} @ LLVM.Interop none:0)
Stacktrace:
 [1] unsafe_store! (repeats 3 times)
   @ ~/packages/LLVM/6cDbl/src/interop/pointer.jl:88
...

Notably, the 'gfx942' is not a recognized processor for this target (ignoring processor) messages are gone now.

pxl-th commented 3 months ago

AMDGPU.jl needs to account for changes in Julia 1.12, I haven't done that yet

giordano commented 2 months ago

AMDGPU.jl needs to account for changes in Julia 1.12, I haven't done that yet

Can you give an indication of what needs to be done? I can't promise anything, but I may or may not have a chance to look into this (if it doesn't take too long :smiling_face_with_tear:)

JuliaGPU / AMDGPU.jl

MI300X (gfx942) support for broadcast operations #621