Slow kernel compilation

maleadt commented 3 years ago

IIRC it used to be 200us or so, and @vtjnash mentioned Base compilation taking ~1us~ 1ms or so for the simplest functions. This is pretty slow then:

julia> kernel() = return
kernel (generic function with 1 method)

julia> include("pkg/GPUCompiler/test/definitions/native.jl")
native_code_execution (generic function with 1 method)

julia> job, kwargs = native_job(kernel, Tuple{})
(CompilerJob of function kernel() for NativeCompilerTarget, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())

julia> @benchmark GPUCompiler.compile(:asm, job)
BenchmarkTools.Trial: 
  memory estimate:  389.70 KiB
  allocs estimate:  7019
  --------------
  minimum time:     7.027 ms (0.00% GC)
  median time:      7.784 ms (0.00% GC)
  mean time:        7.805 ms (0.28% GC)
  maximum time:     11.175 ms (24.24% GC)
  --------------
  samples:          640
  evals/sample:     1

ToucheSir commented 2 years ago

I think I came across this while working on https://discourse.julialang.org/t/a-demo-is-1-5x-faster-in-flux-than-tensorflow-both-use-cpu-while-3-0x-slower-during-using-cuda/66703/. Most Flux optimisers (e.g. ADAM) use a set of fused broadcasts, and it seems compiling these can take a significant amount of time.

maleadt commented 2 years ago

Do you have a reproducer without all of Flux (or without having to use Flux at run time)? I.e., is there a specific kernel that compiles excessively slow, or is it just that we're compiling a lot of kernels somewhat slowly?

ToucheSir commented 2 years ago

I believe it's the latter, specifically fused broadcast kernels:

using CUDA

function adam_combined!(Δ, mt, vt, β, ϵ, η)
    @. mt = β[1] * mt + (1 - β[1]) * Δ
    @. vt = β[1] * vt + (1 - β[2]) * Δ^2
    @. Δ = mt / (1 - β[1]) / (√(vt / (1 - β[2])) + ϵ) * η
    return mt, vt
end

# Ensure broadcast machinery is already compiled`
let a = CUDA.rand(10), b = CUDA.rand(10)
    @. a = a * b / a + b
end

let Δ = CUDA.rand(10, 10), mt = CUDA.zeros(10, 10), vt = CUDA.zeros(10, 10), β = (0.9, 0.999)
    CUDA.GPUCompiler.enable_timings()
    @time CUDA.@sync adam_combined!(Δ, mt, vt, β, Float32(1e-8), 0.001f0)
    CUDA.GPUCompiler.timings()
end

Which outputs:

 3.119189 seconds (8.26 M allocations: 472.070 MiB, 4.08% gc time)

 ─────────────────────────────────────────────────────────────────────────────
                                      Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:       18446042150s / 0.00%        3.81GiB / 0.09%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 LLVM middle-end           3    567ms  76.7%   189ms    764KiB  22.4%   255KiB
   optimization            3    486ms  65.7%   162ms   19.8KiB  0.58%  6.61KiB
   target libraries        3   73.2ms  9.89%  24.4ms    711KiB  20.8%   237KiB
   clean-up                3   5.42ms  0.73%  1.81ms   2.91KiB  0.09%     992B
   runtime library         3   2.92ms  0.39%   973μs      720B  0.02%     240B
 IR generation             3    172ms  23.3%  57.5ms   2.59MiB  77.6%   884KiB
   rewrite                 3   89.6ms  12.1%  29.9ms    426KiB  12.5%   142KiB
   emission                3   14.5ms  1.96%  4.85ms   1.71MiB  51.3%   584KiB
   clean-up                3   25.6μs  0.00%  8.54μs     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────────────

I'm not sure why there's such a large gap between the total runtime and tracked time from GPUCompiler. The time between cuModuleLoadDataEx -> cuMemHostAlloc -> cuMemcpyHtoDAsync -> broadcast kernel is ~100ms, so at best that leaves another 1.5s+ unaccounted for.

I'm working on capturing a flamegraph for the host-side operations, will update with that shortly. Edit: see https://gist.github.com/ToucheSir/43077d99e2318004c175cd5ba1346ea7.

maleadt commented 2 years ago

Lots of inference time, it seems?

ToucheSir commented 2 years ago

That, and the third broadcast spends the bulk of its time running the LLVM pass pipeline? I guess I'm confused by why both would be so much slower for the GPU path compared to the CPU one.

jpsamaroo commented 2 years ago

CPU compilation uses Julia's native code and inference caches, which are already "hot" from the sysimage and precompile data. GPU compilation through GPUCompiler essentially starts with empty caches, requiring reinferring and recompiling "the world" (every function called or callable during GPU kernel execution).

ToucheSir commented 2 years ago

Is that for every new kernel or just the first one? I tried running a broadcasted operation before profiling in the hopes of having as much as possible "warm", but maybe that wasn't enough?

I also just tried turning each broadcast into its own kernel (with synchronization in between) and that runs in less time (1.5s vs 2.4s) than the broadcasted version. Does compiling fused broadcasts also add significant overhead somewhere?

maleadt commented 2 years ago

Does compiling fused broadcasts also add significant overhead somewhere?

Inference of broadcast is significantly harder.

Is that for every new kernel or just the first one?

Every initial compilation of a new function (e.g. a broadcast with a new closure) will require some inference.

You can try measuring after cleaning some of the caches: empty!(CUDA.cufunction_cache(context())) to drop the compiled CUDA objects (which will require recompilation at the LLVM level), and empty!(CUDA.ci_cache) to additionally drop the inference cache (and thus require inference).

There may be an additional source of latency: initial compilation of the Julia compiler itself. Since we use a custom AbstractInterpreter, it's possible a lot of code there is recompiled for the first kernel, but not for subsequent compilations (i.e. wiping the above caches shouldn't affect that).

ToucheSir commented 2 years ago

Thanks for the explanation. For Flux, I guess that leaves us with either adding dispatches to native kernels/KernelAbstractions kernels on GPU to ameliorate the overhead of multiple broadcasts, or waiting for as-yet-unknown upstream compiler improvements to smooth things out.

maleadt commented 2 years ago

FWIW, GPU compilation of simple kernels is pretty OK:

julia> GPUCompiler.enable_timings()

julia> @benchmark ptx_code_execution(identity, Tuple{Nothing})
BenchmarkTools.Trial: 2027 samples with 1 evaluation.
 Range (min … max):  2.279 ms … 175.390 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.363 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.462 ms ±   3.856 ms  ┊ GC (mean ± σ):  0.13% ± 1.65%

  █                                                            
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁   ▁
  3.38 ms         Histogram: frequency by time        2.35 ms <

 Memory estimate: 54.94 KiB, allocs estimate: 760.

julia> GPUCompiler.timings()
 ────────────────────────────────────────────────────────────────────────────────────
                                             Time                   Allocations      
                                     ──────────────────────   ───────────────────────
          Tot / % measured:               85.9s / 8.75%           2.11GiB / 6.11%    

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 LLVM middle-end              4.11k    3.75s  49.9%   912μs   14.3MiB  10.9%  3.56KiB
   optimization               4.11k    3.63s  48.3%   884μs   10.9MiB  8.28%  2.72KiB
   clean-up                   4.11k   87.9ms  1.17%  21.4μs   1.00MiB  0.76%     256B
   target libraries           4.11k   2.50ms  0.03%   608ns     0.00B  0.00%    0.00B
 LLVM back-end                4.11k    2.55s  33.9%   620μs   7.06MiB  5.36%  1.76KiB
   machine-code generation    4.11k    2.45s  32.6%   596μs   5.96MiB  4.52%  1.48KiB
   preparation                4.11k   89.3ms  1.19%  21.7μs   1.10MiB  0.83%     280B
 IR generation                4.11k    1.08s  14.4%   263μs   99.2MiB  75.3%  24.7KiB
   rewrite                    4.11k    625ms  8.31%   152μs   74.2MiB  56.3%  18.5KiB
     hide unreachable         4.11k   23.4ms  0.31%  5.69μs   5.77MiB  4.38%  1.44KiB
       predecessors           4.11k   4.83ms  0.06%  1.18μs    706KiB  0.52%     176B
       find                   4.11k   1.87ms  0.02%   456ns    642KiB  0.48%     160B
       replace                4.11k    375μs  0.00%  91.4ns     0.00B  0.00%    0.00B
     lower throw              4.11k   9.25ms  0.12%  2.25μs   15.5MiB  11.8%  3.88KiB
     hide trap                4.11k   6.94ms  0.09%  1.69μs   1.63MiB  1.24%     416B
   emission                   4.11k    192ms  2.56%  46.8μs   11.2MiB  8.52%  2.80KiB
   clean-up                   4.11k   14.0ms  0.19%  3.41μs    578KiB  0.43%     144B
 validation                   8.22k    120ms  1.60%  14.7μs   9.15MiB  6.95%  1.14KiB
 Julia front-end              4.11k   19.8ms  0.26%  4.81μs   2.07MiB  1.57%     528B
 ────────────────────────────────────────────────────────────────────────────────────

With some of the more complicated broadcasts:

julia> @eval GPUArrays begin
           function broadcast_kernel(ctx, dest, bc′, nelem)
               for i in 1:nelem
                   I = @cartesianidx(dest, i)
                   @inbounds dest[I] = bc′[I]
               end
               return
           end
           end
# I couldn't figure out how to use the anonymous function defined in GPUArrays, the only function, #broadcast_kernel#17#, didn't have any methods

julia> GPUCompiler.enable_timings()

julia> @benchmark CUDA.code_ptx(devnull, GPUArrays.broadcast_kernel, Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}}}, Float32}}}}, Float32}}, Int64})
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  444.470 ms … 474.020 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     449.698 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   452.559 ms ±  10.287 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                              
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  473 ms           Histogram: frequency by time          453 ms <

 Memory estimate: 1.21 MiB, allocs estimate: 18680.

julia> GPUCompiler.timings()
 ────────────────────────────────────────────────────────────────────────────────────
                                             Time                   Allocations      
                                     ──────────────────────   ───────────────────────
          Tot / % measured:                882s / 1.49%           6.21GiB / 0.69%    

 Section                     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 LLVM middle-end                 29    12.7s  96.8%   439ms   14.9MiB  34.2%   528KiB
   optimization                  29    11.5s  87.5%   396ms    401KiB  0.90%  13.8KiB
   target libraries              29    1.09s  8.28%  37.5ms   13.1MiB  29.9%   462KiB
   clean-up                      29   93.8ms  0.71%  3.23ms   34.0KiB  0.08%  1.17KiB
   runtime library               29   33.6ms  0.26%  1.16ms   4.53KiB  0.01%     160B
 IR generation                   29    294ms  2.24%  10.1ms   28.5MiB  65.2%  0.98MiB
   rewrite                       29    127ms  0.96%  4.37ms   11.5MiB  26.3%   406KiB
     lower throw                 29   2.67ms  0.02%  92.2μs   1.86MiB  4.26%  65.8KiB
     hide trap                   29    411μs  0.00%  14.2μs    155KiB  0.35%  5.33KiB
   emission                      29    108ms  0.82%  3.71ms   16.6MiB  38.1%   587KiB
   clean-up                      29    614μs  0.00%  21.2μs   86.5KiB  0.19%  2.98KiB
 LLVM back-end                   29    121ms  0.92%  4.19ms    240KiB  0.54%  8.26KiB
   machine-code generation       29    121ms  0.92%  4.16ms    212KiB  0.47%  7.31KiB
   preparation                   29    813μs  0.01%  28.0μs   26.1KiB  0.06%     920B
 Julia front-end                 29    362μs  0.00%  12.5μs   17.7KiB  0.04%     624B
 validation                      29    301μs  0.00%  10.4μs   17.7KiB  0.04%     624B
 strip debug info                29    123μs  0.00%  4.24μs     0.00B  0.00%    0.00B
 ────────────────────────────────────────────────────────────────────────────────────

With 85% spent during LLVM optimization, I'm not sure we can do much better.

Also interesting; dropping the inference cache before every compilation adds about 20%:

julia> @benchmark (empty!(CUDA.ci_cache); CUDA.code_ptx(devnull, GPUArrays.broadcast_kernel, Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}}}, Float32}}}}, Float32}}, Int64}))
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  557.845 ms … 571.197 ms  ┊ GC (min … max): 1.11% … 2.94%
 Time  (median):     568.254 ms               ┊ GC (median):    1.94%
 Time  (mean ± σ):   565.630 ms ±   4.952 ms  ┊ GC (mean ± σ):  1.70% ± 0.87%

  █                                                              
  █▁▁▁▁ ▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  569 ms           Histogram: frequency by time          561 ms <

 Memory estimate: 81.53 MiB, allocs estimate: 1512413.

Again, all this is different from the time-to-first-kernel, which includes a lot of time (re)compiling the Julia and GPU compilers itself.

vchuravy commented 2 years ago

There may be an additional source of latency: initial compilation of the Julia compiler itself. Since we use a custom AbstractInterpreter, it's possible a lot of code there is recompiled for the first kernel, but not for subsequent compilations (i.e. wiping the above caches shouldn't affect that).

That sounds like something we could precompile?

maleadt commented 2 years ago

That sounds like something we could precompile?

You can't precompile another module's methods, right? I thought there was a Base issue for this, can't find it immediately.

vchuravy commented 2 years ago

I thought you can as long as you own one of the types? But yes arcane rules

JuliaGPU / GPUCompiler.jl

Slow kernel compilation #65