Open maleadt opened 1 year ago
Does Metal support register spilling? What happens when you exceed the total numbers of registers available (on CUDA 255 iirc)
Actually, this isn't a launch failure, it's a compilation failure. So it has nothing to do with the launch configuration. It also means that there's a hard limit on how many registers a kernel can use, however, there's no way to query either that limit or the amount of registers a kernel uses.
So I guess we can't do anything about this...
I think it can spill though. Dummy kernel:
using Metal
function kernel(a::AbstractArray{<:NTuple{N, T}}) where {N, T}
i = thread_position_in_grid_1d()
@inbounds begin
# load a large tuple
x = a[i]
# force all of the tuple to be available
s = zero(T)
for i in 1:N
s += x[i]
end
y = let s = s
ntuple(i->x[i]+s, Val(N))
end
# write back out
a[i] = y
end
return
end
function main(N=1)
x = MtlArray{NTuple{N, Int}}(undef, 1)
@metal threads=len kernel(x)
end
If I have it load a large amount of data (1000 elements), the generated code starts with:
0: f2051004 get_sr r1.cache, sr80 (thread_position_in_grid.x)
4: 62f9000000000030 mov_imm r126, 0
c: 9e07c28610840100 imadd r1_r2.cache, r1.discard.sx, u4l, u2
14: 62fd000000000030 mov_imm r127, 0
1c: 0e09c46218000000 iadd r2, r2.discard, u3
24: 0529020500c8f200 device_load 0, i32, xyzw, r5_r6_r7_r8, r1_r2, 0, signed, lsl 2
2c: 0549124500c8f200 device_load 1, i32, xyzw, r9_r10_r11_r12, r1_r2, 1, signed, lsl 2
34: b500c1052a80000f stack_store i8, 1, 2, 0, 4012, 0
3c: 3801 wait 1
3e: b54a00050cc0f23b stack_store i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15296, 0
46: 0549224500c8f200 device_load 1, i32, xyzw, r9_r10_r11_r12, r1_r2, 2, signed, lsl 2
4e: 3801 wait 1
50: b54a00050bc0f23b stack_store i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15280, 0
58: 0549324500c8f200 device_load 1, i32, xyzw, r9_r10_r11_r12, r1_r2, 3, signed, lsl 2
60: 3801 wait 1
62: b54a00050ac0f23b stack_store i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15264, 0
6a: 0549424500c8f200 device_load 1, i32, xyzw, r9_r10_r11_r12, r1_r2, 4, signed, lsl 2
72: 3801 wait 1
74: b54a000509c0f23b stack_store i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15248, 0
7c: 0549524500c8f200 device_load 1, i32, xyzw, r9_r10_r11_r12, r1_r2, 5, signed, lsl 2
84: 3801 wait 1
...
i.e. loading device memory into registers, and spilling it immediately after. Computing the sum and storing the resulting tuple then consists of a sequence of:
82c: 352a000506c0f205 stack_load r5_r6_r7_r8, i32, 1, 0, xyzw, 4, 1376, 0
834: 3800 wait 0
836: 8e1986c22c000000 iadd r6.cache, r3.cache, r6.discard
83e: 8e0546022d000000 iadd r1.cache, r3, r8.discard
846: 8e1584a22c000000 iadd r5.cache, r2.cache, r5.discard
84e: 8e1d84e22c000000 iadd r7.cache, r2.cache, r7.discard
856: f2211004 get_sr r8.cache, sr80 (thread_position_in_grid.x)
85a: 9e23d08610000000 imadd r8_r9.cache, r8.discard.sx, u4l, 0
862: 8e2986212d000000 iadd r10.cache, u3, r9.discard
86a: 8e5584012d200000 iadd r85.cache, u2, r8.discard
872: 92104a4228010130 icmpsel ult, r4l.cache, r5, r2.cache, 1, 0
87a: 92a06a42180101300008 icmpsel ult, r8l.cache, r85, u2, 1, 0
884: 92024e4224010130 icmpsel ult, r0h.cache, r7, r2, 1, 0
88c: 0e59d0402d200000 iadd r86, r8l.discard, r10.discard
894: 0e19c8c02c000000 iadd r6, r4l.discard, r6.discard
89c: 0e21c1202c000000 iadd r8, r0h.discard, r1.discard
8a4: 45290a05a0c8f200 device_store 0, i32, xyzw, r5_r6_r7_r8, r85_r86, 0, signed, lsl 2, 0
@ChrisRackauckas @utkarsh530 Do either of you remember where exactly this happened, and how to reproduce? I'd like to investigate, but that's hard without a MWE.
this came from the kernel generating methods of DiffEqGPU.
using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA
function lorenz2(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return SVector{3}(du1, du2, du3)
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
trajectories = 10_000,
saveat = 1.0f0)
Is the small case that should work, but if you pump it to one of the bigger ODEs it should hit this. @utkarsh530 do you remember which ODE hit it?
I'm getting Float64 values with that MWE:
julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
trajectories = 10_000,
saveat = 1.0f0)
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_ode_asolve_kernel(::KernelAbstractions.CompilerMetadata{…}, ::MtlDeviceVector{…}, ::GPUTsit5, ::MtlDeviceMatrix{…}, ::MtlDeviceMatrix{…}, ::Float32, ::CallbackSet{…}, ::Nothing, ::Float32, ::Float32, ::StepRangeLen{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported use of double value
Reason: unsupported use of double value
Reason: unsupported use of double value
Stacktrace:
[1] Float64
@ ./float.jl:159
[2] convert
@ ./number.jl:7
[3] _promote
@ ./promotion.jl:370
[4] promote
@ ./promotion.jl:393
[5] *
@ ./promotion.jl:423
[6] unsafe_getindex
@ ./range.jl:963
[7] getindex
@ ./range.jl:956
[8] macro expansion
@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
[9] gpu_ode_asolve_kernel
@ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
[10] gpu_ode_asolve_kernel
@ ./none:0
@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
Is that to be trusted? Because saveat = 1.0f0
and tspan = (0.0f0, 10.0f0)
but the line is if tspan[1] == saveat[1]
? That seems to definitely be Float32.
https://github.com/SciML/DiffEqGPU.jl/pull/317 This might be related to this
Any workaround?
As seen on DiffEqGPU.jl:
It's interesting because IIUC the dynamic workgroup size setting there should have used
maxTotalThreadsPerThreadgroup
, which in the case of CUDA takes register usage into account. Maybe there's additional limits we need to respect with Metal?