10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?)

navidcy commented 1 year ago

I observe a discontinuous doubling of docs building time from 3 hours on

https://buildkite.com/clima/oceananigans/builds/9920#01862a69-61ee-4e59-82b4-5c4b4684e7b6

to 6 hours after "Merge pull request #2899 from CliMA/vc/ka_upgrade"

https://buildkite.com/clima/oceananigans/builds/9935#01862d0c-e009-4a05-a8bb-023d1159a2ae

cc @simone-silvestri, @vchuravy

glwagner commented 1 year ago

I'm detecting also a massive slow down in model execution and out of control allocations on the CPU. Type inference failure? Unfortunate to find it this late after the PRs are merged.

glwagner commented 1 year ago

@navidcy what's the last version before the catastrophic performance loss? I'll do a benchmark to compare with main.

glwagner commented 1 year ago

Kind of a random case, but here's some timings from a calibration problem I'm doing (I'm running 6 3D simulations, ranging from something like 33 to 117 time-steps, either size (6, 10, 32) or (6, 10, 64)). For these tests I just downgraded KernelAbstractions via Project.toml:

With `KernelAbstractions` 0.7.2 (also downgraded CUDAKernels)

 23.405205 seconds (25.27 M allocations: 2.610 GiB, 1.78% gc time, 99.80% compilation time)
  5.019944 seconds (5.92 M allocations: 475.126 MiB, 0.86% gc time, 98.22% compilation time)
  0.067385 seconds (107.25 k allocations: 72.628 MiB)
  0.090308 seconds (107.25 k allocations: 72.628 MiB)
  0.139109 seconds (217.20 k allocations: 147.487 MiB)
  0.197798 seconds (217.20 k allocations: 147.487 MiB)

The two simulations are affected by compilation but things go fast after that.

With `KernelAbstractions` 0.8.6

  4.914645 seconds (28.10 M allocations: 6.039 GiB, 15.62% gc time, 51.22% compilation time)
  5.011717 seconds (31.58 M allocations: 10.844 GiB, 19.87% gc time, 16.87% compilation time)
  4.236418 seconds (27.41 M allocations: 11.073 GiB, 21.75% gc time)
  8.501561 seconds (55.04 M allocations: 22.118 GiB, 22.03% gc time)
  8.618707 seconds (56.01 M allocations: 22.627 GiB, 21.83% gc time)
 17.081286 seconds (112.47 M allocations: 45.197 GiB, 21.73% gc time)

Smells like type inference failure to me. Some informal exploration shows that the tendency calculations dominate this problem (as they do many others) --- so it's a pretty basic issue I suspect.

glwagner commented 1 year ago

Something a little puzzling to me is that we clearly succeed at type inference when running on the GPU. Do we have a better chance of succeeding there because of adapt? We do make some critical simplifications via adapt, most notably

https://github.com/CliMA/Oceananigans.jl/blob/f70d0949f7688df89a7f8a9e7d44177cde27091a/src/Fields/field.jl#L400

navidcy commented 1 year ago

That's an order of magnitude slowdown... And that's not just compilation times, right?

glwagner commented 1 year ago

Compilation actually speeds up, its run time that slows down. (The examples are probably running very very very very slow.)

glwagner commented 1 year ago

So the runtime is 50-100x slower. But because there's other stuff in the docs (and compilation time matters to, the total slow down for the docs build is just 2x).

glwagner commented 1 year ago

@simone-silvestri this is a big deal because 50x is essentially standing still, hardly useable except for the smallest problems.

vchuravy commented 1 year ago

Do you have a profile to share?

simone-silvestri commented 1 year ago

Also a MWE would be nice to debug

glwagner commented 1 year ago

I haven't done any profiling --- just simple benchmarks. (Short example coming soon)

glwagner commented 1 year ago

Ok here's something simple:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Here's what I've done:

Run this on fresh clone of main. This returns

julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  20.460 s (144483404 allocations: 94.43 GiB)

Restrict compat on KernelAbstractions to 0.7.2 and CUDAKernels to 0.3.3. This returns:

julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  2.202 s (118604 allocations: 52.20 MiB)

I'm running on a single core, Mac M1. Here the performance loss is just 10x so I'll change the somewhat dramatic title of this issue.

glwagner commented 1 year ago

With advection = CenteredSecondOrder(), the differences are more dramatic (maybe explains the 100x slow down I saw with a simple setup):

  17.859 s (144483404 allocations: 94.43 GiB) # KA 0.8
  294.401 ms (118604 allocations: 52.20 MiB) # KA 0.7

If we look just at calculate_tendencies! via

using Oceananigans
using Oceananigans.TimeSteppers: calculate_tendencies!
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        #time_step!(model, Δt)
        calculate_tendencies!(model, [])
    end
end

@btime lots_of_steps!(model, 0.01)

results are (advection = WENO())

5.268 s (23061000 allocations: 11.73 GiB) # KA 0.8
1.989 s (14600 allocations: 13.03 MiB) # KA 0.7

and advection = CenteredSecondOrder()

  2.846 s (23061000 allocations: 11.73 GiB) # KA 0.8
105.867 ms (14600 allocations: 13.03 MiB) # KA 0.7

glwagner commented 1 year ago

Simple kernels are safe (in fact, faster)

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@kernel function _diffuse!(c, Δt)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * ∇²ᶜᶜᶜ(i, j, k, grid, c)
end

function diffuse!(c, Δt)
    grid = c.grid
    arch = grid.architecture
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

yields

447.763 ms (9832300 allocations: 975.37 MiB) # KA 0.8
499.522 ms (9832300 allocations: 1.81 GiB) # KA 0.7

adding fill_halo_regions! doesn't change much either

glwagner commented 1 year ago

I found something. It's hard to provide code since I hacked apart the code base for this. I'll try to describe.

I reduced calculate_tendencies! to just calculating the u tendency. I then wrote two methods:

u_velocity_tendency(i, j, k, grid) = zero(grid)

and

@inline function u_velocity_tendency(i, j, k, grid,
                                     advection,
                                     coriolis,
                                     stokes_drift,
                                     closure,
                                     u_immersed_bc,
                                     buoyancy,
                                     background_fields,
                                     velocities,
                                     tracers,
                                     auxiliary_fields,
                                     diffusivities,
                                     forcings,
                                     hydrostatic_pressure,
                                     clock)

    return zero(grid)
end

Even those these functions have identical output, applying launch! to the "long argument list" version is 300x slower. So the problem is in launch! somewhere?

Maybe splatting is the issue since we write

""" Calculate the right-hand-side of the u-velocity equation. """
@kernel function calculate_Gu!(Gu, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds Gu[i, j, k] = u_velocity_tendency(i, j, k, args...)
end

?

glwagner commented 1 year ago

ok... maybe this is a minimal

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.BoundaryConditions: fill_halo_regions!
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@inline laplacian(i, j, k, grid, c, args...) = ∇²ᶜᶜᶜ(i, j, k, grid, c)

@kernel function _diffuse!(c, Δt, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * laplacian(i, j, k, grid, c, args...)
end

function diffuse!(c, Δt)
    grid = c.grid

    Nargs = 2 # this is the key

    args = Tuple(grid for i = 1:Nargs)
    arch = grid.architecture
    fill_halo_regions!(c)
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt, args...)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

Ok, so with nothing difference except KA version, this is what we find:

Nargs	KA 0.8	KA 0.72
1	479.273 ms (9847200 allocations: 977.66 MiB)	686.707 ms (11485600 allocations: 2.57 GiB)
2	1.281 s (14801200 allocations: 4.75 GiB)	840.756 ms (13124400 allocations: 3.32 GiB)
3	1.399 s (16447200 allocations: 5.52 GiB)	981.717 ms (14764000 allocations: 4.08 GiB)

vchuravy commented 1 year ago

So the likely answer here is that CPU execution is no longer running through my custom Cassette infrastructure that inlined everything even on the CPU. So you are hitting the normal case of Julia deciding that it isn't worth inclining complicated functions.

The benefits of not using Cassette are that backtracked are useful (for errors and profiling) and that compile time is much improved.

There is a long-term fix that one could explore... I don't currently have time for this but it might make an interesting UROP project. Use a custom host compiler via GPUCompiler/LLVM.jl/AbstractInterpreter and maybe OpaqueClosurer to mimick the compilation of the GPU code.

vchuravy commented 1 year ago

In some way this comes back to the fundamental question of: What is the point of KernelAbstractions CPU support. I originally intended it only for making debugging easier...

But folks seem to be depending on it as a performance solution... I think it is feasible to get there, but it would require quite a bit of time and effort

glwagner commented 1 year ago

Ok, so if we shouldn't be using KernelAbstractions for performance on CPU, then I think this means we should write our own CPU infrastructure stuff and rely on KA only for GPU? Is that what you recommend?

vchuravy commented 1 year ago

Is that what you recommend?

That wouldn't solve your problem here. KA gives you reasonable performance on the CPU, but since KA 0.8 it is execution story on the CPU much closer to the rest of Julia and it doesn't play special tricks.

glwagner commented 1 year ago

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

vchuravy commented 1 year ago

in developing our own CPU infrastructure

My point is that your own CPU infrastructure will encounter these same issues, except if you are willing to do large amounts of work and then I would invite you to improve KA instead.

glwagner commented 1 year ago

Re-posting from #3026... that PR solved performance problems with NonhydrostaticModel, but HydrostaticFreeSurfaceModel is still 2x slower roughly than when using KA 0.7.2. Here's a simple benchmark:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Results

10.220 s (85845109 allocations: 37.94 GiB) # main
6.284 s (66184308 allocations: 16.31 GiB) # main with KA downgraded to 0.7.2

cc @simone-silvestri

glwagner commented 1 year ago

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

To follow up with @vchuravy, it seems that rewriting just some of the code was sufficient, so we are (probably) in the clear! The lesson learned is that we cannot slurp / splat @kernel function arguments, because it prevents the kernel code from being inlined.

navidcy commented 1 year ago

I get:

  12.372 s (119521270 allocations: 70.48 GiB) # main
  8.320 s (66686606 allocations: 16.40 GiB) # main w downgraded KA 0.7.3 & CudaKernels 0.3.3

50% slowdown and much more allocations (if that matters)

glwagner commented 1 year ago

I think maybe we caught most of the problem kernels but not all...

simone-silvestri commented 1 year ago

Does it change with different solvers? I'll do some testing today to try to snoop out the issue

simone-silvestri commented 1 year ago

The issues with the HydrostaticFreeSurfaceModel are the tendency kernels. The difference with the non-hydrostatic model is that we do not know a priori which RHS function to call (for example CATKE has an :e tracer that requires a different RHS function and the same goes with a 1 equation parameterization of mesoscales that evolves an additional tracer equation for the mesoscale energy :K).

Our solution now is to infer the RHS function and pass it as an argument to the kernel. Apparently, this prevents compilation. I ll come up with a solution today

simone-silvestri commented 1 year ago

Also, the implicit vertical solver seems to affect the performance. I would have to guess that it is because we are passing functions as arguments to the kernel.

CliMA / Oceananigans.jl