Compilation failure due to high register usage

maleadt commented 1 year ago

As seen on DiffEqGPU.jl:

  caused by: NSError: Compute function exceeds available temporary registers (AGXMetal13_3, code 3)
  Stacktrace:
    [1] MTLComputePipelineState(dev::Metal.MTL.MTLDeviceInstance, fun::Metal.MTL.MTLFunctionInstance)
      @ Metal.MTL /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/lib/mtl/compute_pipeline.jl:60
    [2] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}}; return_function::Bool)
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:71
    [3] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:66
    [4] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:132
    [5] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:103
    [6] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:162 [inlined]
    [7] macro expansion
      @ ./lock.jl:267 [inlined]
    [8] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:157
    [9] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:155
   [10] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:77 [inlined]
   [11] (::KernelAbstractions.Kernel{MetalBackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_ode_asolve_kernel)})(::MtlVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, ::Vararg{Any}; ndrange::Int64, workgroupsize::Nothing)
      @ Metal.MetalKernels /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:105
   [12] Kernel
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:101 [inlined]
   [13] #vectorized_asolve#166
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/build/default-macmini-aarch64-3-0/julialang/diffeqgpu-dot-jl/src/solve.jl:182 [inlined]

It's interesting because IIUC the dynamic workgroup size setting there should have used maxTotalThreadsPerThreadgroup, which in the case of CUDA takes register usage into account. Maybe there's additional limits we need to respect with Metal?

vchuravy commented 1 year ago

Does Metal support register spilling? What happens when you exceed the total numbers of registers available (on CUDA 255 iirc)

maleadt commented 1 year ago

Actually, this isn't a launch failure, it's a compilation failure. So it has nothing to do with the launch configuration. It also means that there's a hard limit on how many registers a kernel can use, however, there's no way to query either that limit or the amount of registers a kernel uses.

So I guess we can't do anything about this...

maleadt commented 1 year ago

I think it can spill though. Dummy kernel:

using Metal

function kernel(a::AbstractArray{<:NTuple{N, T}}) where {N, T}
    i = thread_position_in_grid_1d()

    @inbounds begin
        # load a large tuple
        x = a[i]

        # force all of the tuple to be available
        s = zero(T)
        for i in 1:N
            s += x[i]
        end
        y = let s = s
            ntuple(i->x[i]+s, Val(N))
        end

        # write back out
        a[i] = y
    end

    return
end

function main(N=1)
    x = MtlArray{NTuple{N, Int}}(undef, 1)
    @metal threads=len kernel(x)
end

If I have it load a large amount of data (1000 elements), the generated code starts with:

   0: f2051004             get_sr           r1.cache, sr80 (thread_position_in_grid.x)
   4: 62f9000000000030     mov_imm          r126, 0
   c: 9e07c28610840100     imadd            r1_r2.cache, r1.discard.sx, u4l, u2
  14: 62fd000000000030     mov_imm          r127, 0
  1c: 0e09c46218000000     iadd             r2, r2.discard, u3
  24: 0529020500c8f200     device_load      0, i32, xyzw, r5_r6_r7_r8, r1_r2, 0, signed, lsl 2
  2c: 0549124500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 1, signed, lsl 2
  34: b500c1052a80000f     stack_store      i8, 1, 2, 0, 4012, 0
  3c: 3801                 wait             1
  3e: b54a00050cc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15296, 0
  46: 0549224500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 2, signed, lsl 2
  4e: 3801                 wait             1
  50: b54a00050bc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15280, 0
  58: 0549324500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 3, signed, lsl 2
  60: 3801                 wait             1
  62: b54a00050ac0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15264, 0
  6a: 0549424500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 4, signed, lsl 2
  72: 3801                 wait             1
  74: b54a000509c0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15248, 0
  7c: 0549524500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 5, signed, lsl 2
  84: 3801                 wait             1
...

i.e. loading device memory into registers, and spilling it immediately after. Computing the sum and storing the resulting tuple then consists of a sequence of:

 82c: 352a000506c0f205     stack_load       r5_r6_r7_r8, i32, 1, 0, xyzw, 4, 1376, 0
 834: 3800                 wait             0
 836: 8e1986c22c000000     iadd             r6.cache, r3.cache, r6.discard
 83e: 8e0546022d000000     iadd             r1.cache, r3, r8.discard
 846: 8e1584a22c000000     iadd             r5.cache, r2.cache, r5.discard
 84e: 8e1d84e22c000000     iadd             r7.cache, r2.cache, r7.discard
 856: f2211004             get_sr           r8.cache, sr80 (thread_position_in_grid.x)
 85a: 9e23d08610000000     imadd            r8_r9.cache, r8.discard.sx, u4l, 0
 862: 8e2986212d000000     iadd             r10.cache, u3, r9.discard
 86a: 8e5584012d200000     iadd             r85.cache, u2, r8.discard
 872: 92104a4228010130     icmpsel          ult, r4l.cache, r5, r2.cache, 1, 0
 87a: 92a06a42180101300008 icmpsel          ult, r8l.cache, r85, u2, 1, 0
 884: 92024e4224010130     icmpsel          ult, r0h.cache, r7, r2, 1, 0
 88c: 0e59d0402d200000     iadd             r86, r8l.discard, r10.discard
 894: 0e19c8c02c000000     iadd             r6, r4l.discard, r6.discard
 89c: 0e21c1202c000000     iadd             r8, r0h.discard, r1.discard
 8a4: 45290a05a0c8f200     device_store     0, i32, xyzw, r5_r6_r7_r8, r85_r86, 0, signed, lsl 2, 0

maleadt commented 3 months ago

@ChrisRackauckas @utkarsh530 Do either of you remember where exactly this happened, and how to reproduce? I'd like to investigate, but that's hard without a MWE.

ChrisRackauckas commented 2 months ago

this came from the kernel generating methods of DiffEqGPU.

using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

function lorenz2(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
    trajectories = 10_000,
    saveat = 1.0f0)

Is the small case that should work, but if you pump it to one of the bigger ODEs it should hit this. @utkarsh530 do you remember which ODE hit it?

maleadt commented 2 months ago

I'm getting Float64 values with that MWE:

julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
           trajectories = 10_000,
           saveat = 1.0f0)
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_ode_asolve_kernel(::KernelAbstractions.CompilerMetadata{…}, ::MtlDeviceVector{…}, ::GPUTsit5, ::MtlDeviceMatrix{…}, ::MtlDeviceMatrix{…}, ::Float32, ::CallbackSet{…}, ::Nothing, ::Float32, ::Float32, ::StepRangeLen{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported use of double value
Reason: unsupported use of double value
Reason: unsupported use of double value
Stacktrace:
  [1] Float64
    @ ./float.jl:159
  [2] convert
    @ ./number.jl:7
  [3] _promote
    @ ./promotion.jl:370
  [4] promote
    @ ./promotion.jl:393
  [5] *
    @ ./promotion.jl:423
  [6] unsafe_getindex
    @ ./range.jl:963
  [7] getindex
    @ ./range.jl:956
  [8] macro expansion
    @ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
  [9] gpu_ode_asolve_kernel
    @ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
 [10] gpu_ode_asolve_kernel
    @ ./none:0

ChrisRackauckas commented 2 months ago

@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87

Is that to be trusted? Because saveat = 1.0f0 and tspan = (0.0f0, 10.0f0) but the line is if tspan[1] == saveat[1]? That seems to definitely be Float32.

utkarsh530 commented 2 months ago

https://github.com/SciML/DiffEqGPU.jl/pull/317 This might be related to this

maleadt commented 1 month ago

Any workaround?

JuliaGPU / Metal.jl

Compilation failure due to high register usage #214