Closed navidcy closed 1 year ago
I'm detecting also a massive slow down in model execution and out of control allocations on the CPU. Type inference failure? Unfortunate to find it this late after the PRs are merged.
@navidcy what's the last version before the catastrophic performance loss? I'll do a benchmark to compare with main
.
Kind of a random case, but here's some timings from a calibration problem I'm doing (I'm running 6 3D simulations, ranging from something like 33 to 117 time-steps, either size (6, 10, 32) or (6, 10, 64)). For these tests I just downgraded KernelAbstractions via Project.toml:
KernelAbstractions
0.7.2 (also downgraded CUDAKernels) 23.405205 seconds (25.27 M allocations: 2.610 GiB, 1.78% gc time, 99.80% compilation time)
5.019944 seconds (5.92 M allocations: 475.126 MiB, 0.86% gc time, 98.22% compilation time)
0.067385 seconds (107.25 k allocations: 72.628 MiB)
0.090308 seconds (107.25 k allocations: 72.628 MiB)
0.139109 seconds (217.20 k allocations: 147.487 MiB)
0.197798 seconds (217.20 k allocations: 147.487 MiB)
The two simulations are affected by compilation but things go fast after that.
KernelAbstractions
0.8.6 4.914645 seconds (28.10 M allocations: 6.039 GiB, 15.62% gc time, 51.22% compilation time)
5.011717 seconds (31.58 M allocations: 10.844 GiB, 19.87% gc time, 16.87% compilation time)
4.236418 seconds (27.41 M allocations: 11.073 GiB, 21.75% gc time)
8.501561 seconds (55.04 M allocations: 22.118 GiB, 22.03% gc time)
8.618707 seconds (56.01 M allocations: 22.627 GiB, 21.83% gc time)
17.081286 seconds (112.47 M allocations: 45.197 GiB, 21.73% gc time)
Smells like type inference failure to me. Some informal exploration shows that the tendency calculations dominate this problem (as they do many others) --- so it's a pretty basic issue I suspect.
Something a little puzzling to me is that we clearly succeed at type inference when running on the GPU. Do we have a better chance of succeeding there because of adapt
? We do make some critical simplifications via adapt
, most notably
That's an order of magnitude slowdown... And that's not just compilation times, right?
Compilation actually speeds up, its run time that slows down. (The examples are probably running very very very very slow.)
So the runtime is 50-100x slower. But because there's other stuff in the docs (and compilation time matters to, the total slow down for the docs build is just 2x).
@simone-silvestri this is a big deal because 50x is essentially standing still, hardly useable except for the smallest problems.
Do you have a profile to share?
Also a MWE would be nice to debug
I haven't done any profiling --- just simple benchmarks. (Short example coming soon)
Ok here's something simple:
using Oceananigans
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2Ï€), y=(0, 2Ï€), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
time_step!(model, Δt)
end
end
@btime lots_of_steps!(model, 0.01)
Here's what I've done:
main
. This returnsjulia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
20.460 s (144483404 allocations: 94.43 GiB)
julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
2.202 s (118604 allocations: 52.20 MiB)
I'm running on a single core, Mac M1. Here the performance loss is just 10x so I'll change the somewhat dramatic title of this issue.
With advection = CenteredSecondOrder()
, the differences are more dramatic (maybe explains the 100x slow down I saw with a simple setup):
17.859 s (144483404 allocations: 94.43 GiB) # KA 0.8
294.401 ms (118604 allocations: 52.20 MiB) # KA 0.7
If we look just at calculate_tendencies!
via
using Oceananigans
using Oceananigans.TimeSteppers: calculate_tendencies!
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2Ï€), y=(0, 2Ï€), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
#time_step!(model, Δt)
calculate_tendencies!(model, [])
end
end
@btime lots_of_steps!(model, 0.01)
results are (advection = WENO()
)
5.268 s (23061000 allocations: 11.73 GiB) # KA 0.8
1.989 s (14600 allocations: 13.03 MiB) # KA 0.7
and advection = CenteredSecondOrder()
2.846 s (23061000 allocations: 11.73 GiB) # KA 0.8
105.867 ms (14600 allocations: 13.03 MiB) # KA 0.7
Simple kernels are safe (in fact, faster)
using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ
using KernelAbstractions: @kernel, @index
using BenchmarkTools
@kernel function _diffuse!(c, Δt)
i, j, k = @index(Global, NTuple)
@inbounds c[i, j, k] += Δt * ∇²ᶜᶜᶜ(i, j, k, grid, c)
end
function diffuse!(c, Δt)
grid = c.grid
arch = grid.architecture
ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt)
wait(device(arch), ev)
return nothing
end
function lots_of_steps!(c, Δt, steps=100)
for _ = 1:steps
diffuse!(c, Δt)
end
end
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2Ï€), y=(0, 2Ï€), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)
yields
447.763 ms (9832300 allocations: 975.37 MiB) # KA 0.8
499.522 ms (9832300 allocations: 1.81 GiB) # KA 0.7
adding fill_halo_regions!
doesn't change much either
I found something. It's hard to provide code since I hacked apart the code base for this. I'll try to describe.
I reduced calculate_tendencies!
to just calculating the u
tendency. I then wrote two methods:
u_velocity_tendency(i, j, k, grid) = zero(grid)
and
@inline function u_velocity_tendency(i, j, k, grid,
advection,
coriolis,
stokes_drift,
closure,
u_immersed_bc,
buoyancy,
background_fields,
velocities,
tracers,
auxiliary_fields,
diffusivities,
forcings,
hydrostatic_pressure,
clock)
return zero(grid)
end
Even those these functions have identical output, applying launch!
to the "long argument list" version is 300x slower. So the problem is in launch!
somewhere?
Maybe splatting is the issue since we write
""" Calculate the right-hand-side of the u-velocity equation. """
@kernel function calculate_Gu!(Gu, args...)
i, j, k = @index(Global, NTuple)
@inbounds Gu[i, j, k] = u_velocity_tendency(i, j, k, args...)
end
?
ok... maybe this is a minimal
using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.BoundaryConditions: fill_halo_regions!
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ
using KernelAbstractions: @kernel, @index
using BenchmarkTools
@inline laplacian(i, j, k, grid, c, args...) = ∇²ᶜᶜᶜ(i, j, k, grid, c)
@kernel function _diffuse!(c, Δt, args...)
i, j, k = @index(Global, NTuple)
@inbounds c[i, j, k] += Δt * laplacian(i, j, k, grid, c, args...)
end
function diffuse!(c, Δt)
grid = c.grid
Nargs = 2 # this is the key
args = Tuple(grid for i = 1:Nargs)
arch = grid.architecture
fill_halo_regions!(c)
ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt, args...)
wait(device(arch), ev)
return nothing
end
function lots_of_steps!(c, Δt, steps=100)
for _ = 1:steps
diffuse!(c, Δt)
end
end
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2Ï€), y=(0, 2Ï€), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)
Ok, so with nothing difference except KA version, this is what we find:
Nargs | KA 0.8 | KA 0.72 |
---|---|---|
1 | 479.273 ms (9847200 allocations: 977.66 MiB) | 686.707 ms (11485600 allocations: 2.57 GiB) |
2 | 1.281 s (14801200 allocations: 4.75 GiB) | 840.756 ms (13124400 allocations: 3.32 GiB) |
3 | 1.399 s (16447200 allocations: 5.52 GiB) | 981.717 ms (14764000 allocations: 4.08 GiB) |
So the likely answer here is that CPU execution is no longer running through my custom Cassette infrastructure that inlined everything even on the CPU. So you are hitting the normal case of Julia deciding that it isn't worth inclining complicated functions.
The benefits of not using Cassette are that backtracked are useful (for errors and profiling) and that compile time is much improved.
There is a long-term fix that one could explore... I don't currently have time for this but it might make an interesting UROP project. Use a custom host compiler via GPUCompiler/LLVM.jl/AbstractInterpreter and maybe OpaqueClosurer to mimick the compilation of the GPU code.
In some way this comes back to the fundamental question of: What is the point of KernelAbstractions CPU support. I originally intended it only for making debugging easier...
But folks seem to be depending on it as a performance solution... I think it is feasible to get there, but it would require quite a bit of time and effort
Ok, so if we shouldn't be using KernelAbstractions for performance on CPU, then I think this means we should write our own CPU infrastructure stuff and rely on KA only for GPU? Is that what you recommend?
Is that what you recommend?
That wouldn't solve your problem here. KA gives you reasonable performance on the CPU, but since KA 0.8 it is execution story on the CPU much closer to the rest of Julia and it doesn't play special tricks.
Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?
Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.
in developing our own CPU infrastructure
My point is that your own CPU infrastructure will encounter these same issues, except if you are willing to do large amounts of work and then I would invite you to improve KA instead.
Re-posting from #3026... that PR solved performance problems with NonhydrostaticModel
, but HydrostaticFreeSurfaceModel
is still 2x slower roughly than when using KA 0.7.2. Here's a simple benchmark:
using Oceananigans
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2Ï€), y=(0, 2Ï€), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
time_step!(model, Δt)
end
end
@btime lots_of_steps!(model, 0.01)
Results
10.220 s (85845109 allocations: 37.94 GiB) # main
6.284 s (66184308 allocations: 16.31 GiB) # main with KA downgraded to 0.7.2
cc @simone-silvestri
Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?
Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.
To follow up with @vchuravy, it seems that rewriting just some of the code was sufficient, so we are (probably) in the clear! The lesson learned is that we cannot slurp / splat @kernel
function arguments, because it prevents the kernel code from being inlined.
I get:
12.372 s (119521270 allocations: 70.48 GiB) # main
8.320 s (66686606 allocations: 16.40 GiB) # main w downgraded KA 0.7.3 & CudaKernels 0.3.3
50% slowdown and much more allocations (if that matters)
I think maybe we caught most of the problem kernels but not all...
Does it change with different solvers? I'll do some testing today to try to snoop out the issue
The issues with the HydrostaticFreeSurfaceModel
are the tendency kernels. The difference with the non-hydrostatic model is that we do not know a priori which RHS function to call (for example CATKE has an :e
tracer that requires a different RHS function and the same goes with a 1 equation parameterization of mesoscales that evolves an additional tracer equation for the mesoscale energy :K
).
Our solution now is to infer the RHS function and pass it as an argument to the kernel. Apparently, this prevents compilation. I ll come up with a solution today
Also, the implicit vertical solver seems to affect the performance. I would have to guess that it is because we are passing functions as arguments to the kernel.
I observe a discontinuous doubling of docs building time from 3 hours on
https://buildkite.com/clima/oceananigans/builds/9920#01862a69-61ee-4e59-82b4-5c4b4684e7b6
to 6 hours after "Merge pull request #2899 from CliMA/vc/ka_upgrade"
https://buildkite.com/clima/oceananigans/builds/9935#01862d0c-e009-4a05-a8bb-023d1159a2ae
cc @simone-silvestri, @vchuravy