Open maleadt opened 3 years ago
I think I came across this while working on https://discourse.julialang.org/t/a-demo-is-1-5x-faster-in-flux-than-tensorflow-both-use-cpu-while-3-0x-slower-during-using-cuda/66703/. Most Flux optimisers (e.g. ADAM) use a set of fused broadcasts, and it seems compiling these can take a significant amount of time.
Do you have a reproducer without all of Flux (or without having to use Flux at run time)? I.e., is there a specific kernel that compiles excessively slow, or is it just that we're compiling a lot of kernels somewhat slowly?
I believe it's the latter, specifically fused broadcast kernels:
using CUDA
function adam_combined!(Δ, mt, vt, β, ϵ, η)
@. mt = β[1] * mt + (1 - β[1]) * Δ
@. vt = β[1] * vt + (1 - β[2]) * Δ^2
@. Δ = mt / (1 - β[1]) / (√(vt / (1 - β[2])) + ϵ) * η
return mt, vt
end
# Ensure broadcast machinery is already compiled`
let a = CUDA.rand(10), b = CUDA.rand(10)
@. a = a * b / a + b
end
let Δ = CUDA.rand(10, 10), mt = CUDA.zeros(10, 10), vt = CUDA.zeros(10, 10), β = (0.9, 0.999)
CUDA.GPUCompiler.enable_timings()
@time CUDA.@sync adam_combined!(Δ, mt, vt, β, Float32(1e-8), 0.001f0)
CUDA.GPUCompiler.timings()
end
Which outputs:
3.119189 seconds (8.26 M allocations: 472.070 MiB, 4.08% gc time)
─────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 18446042150s / 0.00% 3.81GiB / 0.09%
Section ncalls time %tot avg alloc %tot avg
─────────────────────────────────────────────────────────────────────────────
LLVM middle-end 3 567ms 76.7% 189ms 764KiB 22.4% 255KiB
optimization 3 486ms 65.7% 162ms 19.8KiB 0.58% 6.61KiB
target libraries 3 73.2ms 9.89% 24.4ms 711KiB 20.8% 237KiB
clean-up 3 5.42ms 0.73% 1.81ms 2.91KiB 0.09% 992B
runtime library 3 2.92ms 0.39% 973μs 720B 0.02% 240B
IR generation 3 172ms 23.3% 57.5ms 2.59MiB 77.6% 884KiB
rewrite 3 89.6ms 12.1% 29.9ms 426KiB 12.5% 142KiB
emission 3 14.5ms 1.96% 4.85ms 1.71MiB 51.3% 584KiB
clean-up 3 25.6μs 0.00% 8.54μs 0.00B 0.00% 0.00B
─────────────────────────────────────────────────────────────────────────────
I'm not sure why there's such a large gap between the total runtime and tracked time from GPUCompiler. The time between cuModuleLoadDataEx -> cuMemHostAlloc -> cuMemcpyHtoDAsync -> broadcast kernel
is ~100ms, so at best that leaves another 1.5s+ unaccounted for.
I'm working on capturing a flamegraph for the host-side operations, will update with that shortly. Edit: see https://gist.github.com/ToucheSir/43077d99e2318004c175cd5ba1346ea7.
Lots of inference time, it seems?
That, and the third broadcast spends the bulk of its time running the LLVM pass pipeline? I guess I'm confused by why both would be so much slower for the GPU path compared to the CPU one.
CPU compilation uses Julia's native code and inference caches, which are already "hot" from the sysimage and precompile data. GPU compilation through GPUCompiler essentially starts with empty caches, requiring reinferring and recompiling "the world" (every function called or callable during GPU kernel execution).
Is that for every new kernel or just the first one? I tried running a broadcasted operation before profiling in the hopes of having as much as possible "warm", but maybe that wasn't enough?
I also just tried turning each broadcast into its own kernel (with synchronization in between) and that runs in less time (1.5s vs 2.4s) than the broadcasted version. Does compiling fused broadcasts also add significant overhead somewhere?
Does compiling fused broadcasts also add significant overhead somewhere?
Inference of broadcast is significantly harder.
Is that for every new kernel or just the first one?
Every initial compilation of a new function (e.g. a broadcast with a new closure) will require some inference.
You can try measuring after cleaning some of the caches: empty!(CUDA.cufunction_cache(context()))
to drop the compiled CUDA objects (which will require recompilation at the LLVM level), and empty!(CUDA.ci_cache)
to additionally drop the inference cache (and thus require inference).
There may be an additional source of latency: initial compilation of the Julia compiler itself. Since we use a custom AbstractInterpreter, it's possible a lot of code there is recompiled for the first kernel, but not for subsequent compilations (i.e. wiping the above caches shouldn't affect that).
Thanks for the explanation. For Flux, I guess that leaves us with either adding dispatches to native kernels/KernelAbstractions kernels on GPU to ameliorate the overhead of multiple broadcasts, or waiting for as-yet-unknown upstream compiler improvements to smooth things out.
FWIW, GPU compilation of simple kernels is pretty OK:
julia> GPUCompiler.enable_timings()
julia> @benchmark ptx_code_execution(identity, Tuple{Nothing})
BenchmarkTools.Trial: 2027 samples with 1 evaluation.
Range (min … max): 2.279 ms … 175.390 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.363 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.462 ms ± 3.856 ms ┊ GC (mean ± σ): 0.13% ± 1.65%
█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
3.38 ms Histogram: frequency by time 2.35 ms <
Memory estimate: 54.94 KiB, allocs estimate: 760.
julia> GPUCompiler.timings()
────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 85.9s / 8.75% 2.11GiB / 6.11%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
LLVM middle-end 4.11k 3.75s 49.9% 912μs 14.3MiB 10.9% 3.56KiB
optimization 4.11k 3.63s 48.3% 884μs 10.9MiB 8.28% 2.72KiB
clean-up 4.11k 87.9ms 1.17% 21.4μs 1.00MiB 0.76% 256B
target libraries 4.11k 2.50ms 0.03% 608ns 0.00B 0.00% 0.00B
LLVM back-end 4.11k 2.55s 33.9% 620μs 7.06MiB 5.36% 1.76KiB
machine-code generation 4.11k 2.45s 32.6% 596μs 5.96MiB 4.52% 1.48KiB
preparation 4.11k 89.3ms 1.19% 21.7μs 1.10MiB 0.83% 280B
IR generation 4.11k 1.08s 14.4% 263μs 99.2MiB 75.3% 24.7KiB
rewrite 4.11k 625ms 8.31% 152μs 74.2MiB 56.3% 18.5KiB
hide unreachable 4.11k 23.4ms 0.31% 5.69μs 5.77MiB 4.38% 1.44KiB
predecessors 4.11k 4.83ms 0.06% 1.18μs 706KiB 0.52% 176B
find 4.11k 1.87ms 0.02% 456ns 642KiB 0.48% 160B
replace 4.11k 375μs 0.00% 91.4ns 0.00B 0.00% 0.00B
lower throw 4.11k 9.25ms 0.12% 2.25μs 15.5MiB 11.8% 3.88KiB
hide trap 4.11k 6.94ms 0.09% 1.69μs 1.63MiB 1.24% 416B
emission 4.11k 192ms 2.56% 46.8μs 11.2MiB 8.52% 2.80KiB
clean-up 4.11k 14.0ms 0.19% 3.41μs 578KiB 0.43% 144B
validation 8.22k 120ms 1.60% 14.7μs 9.15MiB 6.95% 1.14KiB
Julia front-end 4.11k 19.8ms 0.26% 4.81μs 2.07MiB 1.57% 528B
────────────────────────────────────────────────────────────────────────────────────
With some of the more complicated broadcasts:
julia> @eval GPUArrays begin
function broadcast_kernel(ctx, dest, bc′, nelem)
for i in 1:nelem
I = @cartesianidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return
end
end
# I couldn't figure out how to use the anonymous function defined in GPUArrays, the only function, #broadcast_kernel#17#, didn't have any methods
julia> GPUCompiler.enable_timings()
julia> @benchmark CUDA.code_ptx(devnull, GPUArrays.broadcast_kernel, Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}}}, Float32}}}}, Float32}}, Int64})
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 444.470 ms … 474.020 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 449.698 ms ┊ GC (median): 0.00%
Time (mean ± σ): 452.559 ms ± 10.287 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
473 ms Histogram: frequency by time 453 ms <
Memory estimate: 1.21 MiB, allocs estimate: 18680.
julia> GPUCompiler.timings()
────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 882s / 1.49% 6.21GiB / 0.69%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
LLVM middle-end 29 12.7s 96.8% 439ms 14.9MiB 34.2% 528KiB
optimization 29 11.5s 87.5% 396ms 401KiB 0.90% 13.8KiB
target libraries 29 1.09s 8.28% 37.5ms 13.1MiB 29.9% 462KiB
clean-up 29 93.8ms 0.71% 3.23ms 34.0KiB 0.08% 1.17KiB
runtime library 29 33.6ms 0.26% 1.16ms 4.53KiB 0.01% 160B
IR generation 29 294ms 2.24% 10.1ms 28.5MiB 65.2% 0.98MiB
rewrite 29 127ms 0.96% 4.37ms 11.5MiB 26.3% 406KiB
lower throw 29 2.67ms 0.02% 92.2μs 1.86MiB 4.26% 65.8KiB
hide trap 29 411μs 0.00% 14.2μs 155KiB 0.35% 5.33KiB
emission 29 108ms 0.82% 3.71ms 16.6MiB 38.1% 587KiB
clean-up 29 614μs 0.00% 21.2μs 86.5KiB 0.19% 2.98KiB
LLVM back-end 29 121ms 0.92% 4.19ms 240KiB 0.54% 8.26KiB
machine-code generation 29 121ms 0.92% 4.16ms 212KiB 0.47% 7.31KiB
preparation 29 813μs 0.01% 28.0μs 26.1KiB 0.06% 920B
Julia front-end 29 362μs 0.00% 12.5μs 17.7KiB 0.04% 624B
validation 29 301μs 0.00% 10.4μs 17.7KiB 0.04% 624B
strip debug info 29 123μs 0.00% 4.24μs 0.00B 0.00% 0.00B
────────────────────────────────────────────────────────────────────────────────────
With 85% spent during LLVM optimization, I'm not sure we can do much better.
Also interesting; dropping the inference cache before every compilation adds about 20%:
julia> @benchmark (empty!(CUDA.ci_cache); CUDA.code_ptx(devnull, GPUArrays.broadcast_kernel, Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}}}, Float32}}}}, Float32}}, Int64}))
BenchmarkTools.Trial: 9 samples with 1 evaluation.
Range (min … max): 557.845 ms … 571.197 ms ┊ GC (min … max): 1.11% … 2.94%
Time (median): 568.254 ms ┊ GC (median): 1.94%
Time (mean ± σ): 565.630 ms ± 4.952 ms ┊ GC (mean ± σ): 1.70% ± 0.87%
█
█▁▁▁▁ ▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
569 ms Histogram: frequency by time 561 ms <
Memory estimate: 81.53 MiB, allocs estimate: 1512413.
Again, all this is different from the time-to-first-kernel, which includes a lot of time (re)compiling the Julia and GPU compilers itself.
There may be an additional source of latency: initial compilation of the Julia compiler itself. Since we use a custom AbstractInterpreter, it's possible a lot of code there is recompiled for the first kernel, but not for subsequent compilations (i.e. wiping the above caches shouldn't affect that).
That sounds like something we could precompile?
That sounds like something we could precompile?
You can't precompile another module's methods, right? I thought there was a Base issue for this, can't find it immediately.
I thought you can as long as you own one of the types? But yes arcane rules
IIRC it used to be 200us or so, and @vtjnash mentioned Base compilation taking ~1us~ 1ms or so for the simplest functions. This is pretty slow then: