CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
985 stars 193 forks source link

WENO5 is much slower on GPUs after Julia 1.6 upgrade #1764

Closed tomchor closed 3 years ago

tomchor commented 3 years ago

I've noticed a few weeks ago that my scripts were much slower after the Julia 1.6 upgrade (which is preventing me from upgrading). I thought it was due to my Julia 1.6 installation but after some tests I now think it's an Oceananigans issue, specifically with the WENO5 scheme.

I ran the MWE below in both Julia 1.5 (with Oceananigans version 0.57.1) and Julia 1.6 (tried several Oceananigans versions but specifically for this example I'm using Oceananigans version 0.58.5) using GPUs and the speed difference is pretty huge. The interesting part is that this difference only happens if I use WENO5 with a GPU. If I use the 2nd order centered scheme there is no significant difference in time (I haven't tried other schemes) and if I run the script on CPUs the time difference also appears to be small.

Here's the script:

using Oceananigans
using Oceananigans.Units
using CUDA: has_cuda
Nx, Ny, Nz = 128, 1600, 64

if has_cuda()
    arch = GPU()
else
    arch = CPU()
    Nx = Int(Nx/4)
    Ny = Int(Ny/4)
    Nz = Int(Nz/4)
end 

topology = (Periodic, Bounded, Bounded)
grid = RegularRectilinearGrid(size=(Nx, Ny, Nz),
                              x=(0, 200),
                              y=(0, 2000),
                              z=(-100, 0),
                              topology=topology)
println("\n", grid, "\n")

model = IncompressibleModel(architecture = arch,
                grid = grid,
                advection = WENO5(),
                timestepper = :RungeKutta3,
                tracers=nothing,
                buoyancy=nothing,
                closure=nothing,
                )
println("\n", model, "\n")

start_time = 1e-9*time_ns()
using Oceanostics: SingleLineProgressMessenger
simulation = Simulation(model, Ξ”t=10seconds,
                        stop_time=10hours,
                        wall_time_limit=23.5hours,
                        iteration_interval=5,
                        progress=SingleLineProgressMessenger(LES=false, initial_wall_time_seconds=start_time),
                        stop_iteration=Inf,)

println("\n", simulation,"\n",)
@info "---> Starting run!\n"
run!(simulation, pickup=false)

The output for Julia 1.5:

[ Info: ---> Starting run!
[ Info: [000.14%] i:      5,     time: 50.000 seconds,     Ξ”t: 10 seconds,     wall time: 1.447 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.28%] i:     10,     time: 1.667 minutes,     Ξ”t: 10 seconds,     wall time: 1.612 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.42%] i:     15,     time: 2.500 minutes,     Ξ”t: 10 seconds,     wall time: 1.751 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.56%] i:     20,     time: 3.333 minutes,     Ξ”t: 10 seconds,     wall time: 1.890 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.69%] i:     25,     time: 4.167 minutes,     Ξ”t: 10 seconds,     wall time: 2.028 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.83%] i:     30,     time:  5 minutes,     Ξ”t: 10 seconds,     wall time: 2.167 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.97%] i:     35,     time: 5.833 minutes,     Ξ”t: 10 seconds,     wall time: 2.307 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.11%] i:     40,     time: 6.667 minutes,     Ξ”t: 10 seconds,     wall time: 2.446 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.25%] i:     45,     time: 7.500 minutes,     Ξ”t: 10 seconds,     wall time: 2.585 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.39%] i:     50,     time: 8.333 minutes,     Ξ”t: 10 seconds,     wall time: 2.724 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.53%] i:     55,     time: 9.167 minutes,     Ξ”t: 10 seconds,     wall time: 2.863 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.67%] i:     60,     time: 10.000 minutes,     Ξ”t: 10 seconds,     wall time: 3.002 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.81%] i:     65,     time: 10.833 minutes,     Ξ”t: 10 seconds,     wall time: 3.141 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [001.94%] i:     70,     time: 11.667 minutes,     Ξ”t: 10 seconds,     wall time: 3.280 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.08%] i:     75,     time: 12.500 minutes,     Ξ”t: 10 seconds,     wall time: 3.419 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.22%] i:     80,     time: 13.333 minutes,     Ξ”t: 10 seconds,     wall time: 3.558 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.36%] i:     85,     time: 14.167 minutes,     Ξ”t: 10 seconds,     wall time: 3.697 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.50%] i:     90,     time: 15.000 minutes,     Ξ”t: 10 seconds,     wall time: 3.836 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.64%] i:     95,     time: 15.833 minutes,     Ξ”t: 10 seconds,     wall time: 3.975 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.78%] i:    100,     time: 16.667 minutes,     Ξ”t: 10 seconds,     wall time: 4.114 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [002.92%] i:    105,     time: 17.500 minutes,     Ξ”t: 10 seconds,     wall time: 4.253 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.06%] i:    110,     time: 18.333 minutes,     Ξ”t: 10 seconds,     wall time: 4.392 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.19%] i:    115,     time: 19.167 minutes,     Ξ”t: 10 seconds,     wall time: 4.531 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.33%] i:    120,     time: 20.000 minutes,     Ξ”t: 10 seconds,     wall time: 4.670 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.47%] i:    125,     time: 20.833 minutes,     Ξ”t: 10 seconds,     wall time: 4.809 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.61%] i:    130,     time: 21.667 minutes,     Ξ”t: 10 seconds,     wall time: 4.948 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.75%] i:    135,     time: 22.500 minutes,     Ξ”t: 10 seconds,     wall time: 5.087 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [003.89%] i:    140,     time: 23.333 minutes,     Ξ”t: 10 seconds,     wall time: 5.226 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.03%] i:    145,     time: 24.167 minutes,     Ξ”t: 10 seconds,     wall time: 5.365 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.17%] i:    150,     time: 25.000 minutes,     Ξ”t: 10 seconds,     wall time: 5.504 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.31%] i:    155,     time: 25.833 minutes,     Ξ”t: 10 seconds,     wall time: 5.643 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.44%] i:    160,     time: 26.667 minutes,     Ξ”t: 10 seconds,     wall time: 5.782 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.58%] i:    165,     time: 27.500 minutes,     Ξ”t: 10 seconds,     wall time: 5.921 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.72%] i:    170,     time: 28.333 minutes,     Ξ”t: 10 seconds,     wall time: 6.060 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [004.86%] i:    175,     time: 29.167 minutes,     Ξ”t: 10 seconds,     wall time: 6.199 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00

While for Julia 1.6 this is the output after the same amount of wall time:

[ Info: ---> Starting run!
[ Info: [000.14%] i:      5,     time: 50.000 seconds,     Ξ”t: 10 seconds,     wall time: 3.453 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.28%] i:     10,     time: 1.667 minutes,     Ξ”t: 10 seconds,     wall time: 4.388 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.42%] i:     15,     time: 2.500 minutes,     Ξ”t: 10 seconds,     wall time: 5.269 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00
[ Info: [000.56%] i:     20,     time: 3.333 minutes,     Ξ”t: 10 seconds,     wall time: 6.150 minutes,     adv CFL: 0.00e+00,     diff CFL: 0.00e+00

Has someone else experienced this?

Any ideas as to what might be causing it? I really would like to upgrade my production ready scripts but there's no way I can do it until this issue is resolved unfortunately :/

ali-ramadhan commented 3 years ago

Thanks for reporting this! Definitely weird... I'll see if I can reproduce on a different machine.

cc @xkykai this could be the reason your Supercloud simulations seemed slower?

glwagner commented 3 years ago

It might be useful to know the GPU and CUDA version being used (not sure).

ali-ramadhan commented 3 years ago

I ran the advection scheme benchmarks and comparing with some older Julia 1.5 results it definitely is slower on the GPU. WENO5 used to only be ~3x slower than CenteredSecondOrder, but now it's 26x slower. All other advection schemes are just as fast as they used to be.

Not slow enough to be CUDA scalar operations so maybe the GPU compiler changed in some way that kernels calling/using WENO5 are compiling into suboptimal machine code?

@maleadt might have some ideas/suggestions but maybe we just have to profile and find the new bottleneck?


              Advection schemes relative performance (GPU)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Architectures β”‚                Schemes β”‚ slowdown β”‚  memory β”‚  allocs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           GPU β”‚    CenteredFourthOrder β”‚  1.38356 β”‚ 1.05911 β”‚ 1.60067 β”‚
β”‚           GPU β”‚    CenteredSecondOrder β”‚      1.0 β”‚     1.0 β”‚     1.0 β”‚
β”‚           GPU β”‚ UpwindBiasedFifthOrder β”‚  1.53145 β”‚  1.0868 β”‚ 1.88203 β”‚
β”‚           GPU β”‚ UpwindBiasedThirdOrder β”‚  1.30611 β”‚ 1.04135 β”‚ 1.42012 β”‚
β”‚           GPU β”‚                  WENO5 β”‚  26.1429 β”‚ 4.68526 β”‚ 38.4468 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Compare with: https://github.com/CliMA/Oceananigans.jl/pull/1169#issuecomment-725471594

glwagner commented 3 years ago

Is the slow down different for different topologies? If so that might be a clue.

CenteredSecondOrder is special because it directly defines the function advective_momentum_flux_Uu:

https://github.com/CliMA/Oceananigans.jl/blob/383173d11a0c96182a4349fc1e33755207bf0886/src/Advection/centered_second_order.jl#L11

The other schemes define symmetric_interpolate_* and left_biased_interpolate_*, etc. For example, CenteredFourthOrder:

https://github.com/CliMA/Oceananigans.jl/blob/383173d11a0c96182a4349fc1e33755207bf0886/src/Advection/centered_fourth_order.jl#L21

These functions are filtered through an if statement if the dimension is Bounded, see:

https://github.com/CliMA/Oceananigans.jl/blob/master/src/Advection/topologically_conditional_interpolation.jl

glwagner commented 3 years ago

https://github.com/CliMA/Oceananigans.jl/pull/1733 changes the advection schemes a bit and adds the "boundary buffer" as an integer that's known at compile time.

Might be worth testing that PR since it changes the topological condition to use that type information. On master we have

https://github.com/CliMA/Oceananigans.jl/blob/383173d11a0c96182a4349fc1e33755207bf0886/src/Advection/topologically_conditional_interpolation.jl#L17-L22

whereas on https://github.com/CliMA/Oceananigans.jl/pull/1733 it's

https://github.com/CliMA/Oceananigans.jl/blob/17c01a9f3eca4e8576458f6c6f444f9cd2278cc3/src/Advection/topologically_conditional_interpolation.jl#L17-L22

francispoulin commented 3 years ago

Yes, I have seen the same thing, as discussed in #1722.

Previously, when looking at speed up on find grids, say 8000^2, I found that the speed up on GPUS was almost 400 times faster compared to the 5th order Upwinding. When @hennyg888 did the same benchmark on julia 1.6, there was still speedup, but only by 200 or so. The speed of the CPUS with 1.5 and 1.6 was very similar. That means that WENO5, in this case, is about half as slow as it used to be.

Sadly, the old data is lost in slack oblivion, but I suppose we could run old could and obtain these results again, if there was desire to do so. But I think this discussion is more fruitful by looking at the nuts and bolts of the method.

Thanks @tomchor for bringing this up as well!

ali-ramadhan commented 3 years ago

I ran the benchmark again with triply periodic but it's still much slower so the issue might be deeper than the logic in topologically_conditional_interpolation.jl.

              Advection schemes relative performance (GPU)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Architectures β”‚                Schemes β”‚ slowdown β”‚  memory β”‚  allocs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           GPU β”‚    CenteredFourthOrder β”‚  1.50326 β”‚ 1.06836 β”‚ 1.69674 β”‚
β”‚           GPU β”‚    CenteredSecondOrder β”‚      1.0 β”‚     1.0 β”‚     1.0 β”‚
β”‚           GPU β”‚ UpwindBiasedFifthOrder β”‚  1.69787 β”‚ 1.09472 β”‚ 1.96539 β”‚
β”‚           GPU β”‚ UpwindBiasedThirdOrder β”‚  1.39899 β”‚ 1.05598 β”‚ 1.57057 β”‚
β”‚           GPU β”‚                  WENO5 β”‚  33.2728 β”‚ 5.21273 β”‚ 43.9286 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
diff --git a/benchmark/benchmark_advection_schemes.jl b/benchmark/benchmark_advection_schemes.jl
index 81b083e1..e6ba8cd6 100644
--- a/benchmark/benchmark_advection_schemes.jl
+++ b/benchmark/benchmark_advection_schemes.jl
@@ -7,7 +7,8 @@ using Benchmarks
 # Benchmark function

 function benchmark_advection_scheme(Arch, Scheme)
-    grid = RegularRectilinearGrid(size=(192, 192, 192), extent=(1, 1, 1))
+    topo = (Periodic, Periodic, Periodic)
+    grid = RegularRectilinearGrid(topology=topo, size=(192, 192, 192), extent=(1, 1, 1))
     model = IncompressibleModel(architecture=Arch(), grid=grid, advection=Scheme())

     time_step!(model, 1) # warmup
tomchor commented 3 years ago

Yeah, I've run quite a bit of tests at this point, and the issue seems persistent and (as far as I could tell) independent of topology (although I haven't tried every single topology option).

Thanks for looking into this, btw. Let's hope it's something simple. Let me know how I can help.

glwagner commented 3 years ago

I guess another thing to check is if there is a similar slowdown with biharmonic diffusivity. In that case we might be able to pin the problem on the larger stencil, perhaps.

francispoulin commented 3 years ago

In #1722 we found using 5th order Upwinding, on a grid with 128^3, that the speed up was about 80. Is the above really saying its 1.69? If so then things have certainly changed from 28 days ago.

ali-ramadhan commented 3 years ago

@francispoulin Ah 1.69 is how much slower UpwindBiasedFifthOrder is on the GPU instead of CenteredSecondOrder (also on the GPU). Below are the raw benchmarks and the CPU -> GPU speedups which show a speedup of ~114x for UpwindBiasedFifthOrder on 192^3 which should agree better with your figure of ~80x.

Actually looking at the advection scheme benchmarks more closely it looks like WENO5 is incurring lots of CPU allocations. According to https://github.com/CliMA/Oceananigans.jl/pull/1169#issuecomment-725471594 changing the advection did not change the number of allocations, but now it does and WENO5 allocates much more memory than the other schemes.

@glwagner I posted the turbulence closure benchmarks below and they seem fine/unchanged.


Advection scheme benchmarks

                                                 Advection scheme benchmarks                                                                                                                                                                                
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                                                                                               
β”‚ Architectures β”‚                Schemes β”‚        min β”‚     median β”‚       mean β”‚        max β”‚    memory β”‚ allocs β”‚ samples β”‚                                                                                                                              
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€                                                                                                                              
β”‚           CPU β”‚    CenteredFourthOrder β”‚    1.541 s β”‚    1.545 s β”‚    1.545 s β”‚    1.548 s β”‚  1.61 MiB β”‚   2096 β”‚       4 β”‚                                                                                                                              
β”‚           CPU β”‚    CenteredSecondOrder β”‚    1.029 s β”‚    1.035 s β”‚    1.036 s β”‚    1.048 s β”‚  1.61 MiB β”‚   2096 β”‚       5 β”‚                                                                                                                              
β”‚           CPU β”‚ UpwindBiasedFifthOrder β”‚    2.250 s β”‚    2.251 s β”‚    2.251 s β”‚    2.252 s β”‚  1.61 MiB β”‚   2096 β”‚       3 β”‚                                                                                                                              
β”‚           CPU β”‚ UpwindBiasedThirdOrder β”‚    1.589 s β”‚    1.594 s β”‚    1.594 s β”‚    1.599 s β”‚  1.61 MiB β”‚   2096 β”‚       4 β”‚                                                                                                                              
β”‚           CPU β”‚                  WENO5 β”‚    6.339 s β”‚    6.339 s β”‚    6.339 s β”‚    6.339 s β”‚  1.61 MiB β”‚   2096 β”‚       1 β”‚                                                                                                                              
β”‚           GPU β”‚    CenteredFourthOrder β”‚  17.309 ms β”‚  17.419 ms β”‚  18.107 ms β”‚  24.384 ms β”‚  2.71 MiB β”‚  27650 β”‚      10 β”‚                                                                                                                              
β”‚           GPU β”‚    CenteredSecondOrder β”‚  10.369 ms β”‚  11.588 ms β”‚  11.472 ms β”‚  11.642 ms β”‚  2.53 MiB β”‚  16296 β”‚      10 β”‚
β”‚           GPU β”‚ UpwindBiasedFifthOrder β”‚  19.561 ms β”‚  19.675 ms β”‚  20.975 ms β”‚  32.694 ms β”‚  2.77 MiB β”‚  32028 β”‚      10 β”‚                 
β”‚           GPU β”‚ UpwindBiasedThirdOrder β”‚  16.131 ms β”‚  16.211 ms β”‚  16.806 ms β”‚  22.239 ms β”‚  2.68 MiB β”‚  25594 β”‚      10 β”‚
β”‚           GPU β”‚                  WENO5 β”‚ 382.916 ms β”‚ 385.558 ms β”‚ 385.368 ms β”‚ 386.709 ms β”‚ 13.21 MiB β”‚ 715860 β”‚      10 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          Advection schemes CPU to GPU speedup                                    
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          
β”‚                Schemes β”‚ speedup β”‚  memory β”‚  allocs β”‚          
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€          
β”‚    CenteredFourthOrder β”‚ 88.7159 β”‚  1.6849 β”‚ 13.1918 β”‚          
β”‚    CenteredSecondOrder β”‚ 89.3514 β”‚ 1.57709 β”‚ 7.77481 β”‚          
β”‚ UpwindBiasedFifthOrder β”‚   114.4 β”‚ 1.72647 β”‚ 15.2805 β”‚            
β”‚ UpwindBiasedThirdOrder β”‚ 98.3274 β”‚ 1.66538 β”‚ 12.2109 β”‚                             
β”‚                  WENO5 β”‚ 16.4404 β”‚ 8.22094 β”‚ 341.536 β”‚                         
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Turbulence closure benchmarks

                  Turbulence closures relative performance (GPU)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Architectures β”‚                         Closures β”‚ slowdown β”‚  memory β”‚  allocs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           GPU β”‚ AnisotropicBiharmonicDiffusivity β”‚   1.5313 β”‚ 1.03189 β”‚ 1.54697 β”‚
β”‚           GPU β”‚           AnisotropicDiffusivity β”‚  1.05623 β”‚ 1.00582 β”‚ 1.01779 β”‚
β”‚           GPU β”‚    AnisotropicMinimumDissipation β”‚  1.46265 β”‚ 1.19908 β”‚ 1.26817 β”‚
β”‚           GPU β”‚             IsotropicDiffusivity β”‚  1.13134 β”‚ 1.00607 β”‚ 1.07995 β”‚
β”‚           GPU β”‚                          Nothing β”‚      1.0 β”‚     1.0 β”‚     1.0 β”‚
β”‚           GPU β”‚                 SmagorinskyLilly β”‚  1.41905 β”‚ 1.30373 β”‚ 1.18683 β”‚
β”‚           GPU β”‚              TwoDimensionalLeith β”‚  1.11312 β”‚ 1.06941 β”‚ 1.06147 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
francispoulin commented 3 years ago

Thanks @ali-ramadhan for clarifying that makes much more sense. A speed up for WENO5 is evern worst than what we saw last month, which is pretty much the same as UpwindBiasedFifthOrder. Hmm....

glwagner commented 3 years ago

I'm sorry, I misinterpreted the results @ali-ramadhan posted. I thought that CenteredSecondOrder was 1.0x slower with julia 1.6 than with 1.5 (and that small slowdowns were observed for the other schemes, which is why I recommended testing the biharmonic scheme.) Now I understand that these results are all for julia 1.6; we are comparing the results with previously obtained benchmarks (not posted) for julia 1.5.

Looking at @tomchor and @ali-ramadhan's results then it looks like simulations with WENO5 are running approximately 6-8 times slower on julia 1.6 than it was on julia 1.5, while other advection schemes (and closures) are unchanged --- correct?

Is the CPU performance of WENO5 roughly equivalent between julia 1.5 and julia 1.6?

tomchor commented 3 years ago

Yes, from what I could test so far the CPU performance seems roughly equivalent between versions. Although it would be good if someone else tried to validate that as well.

On Fri, Jun 25, 2021, 09:16 Gregory L. Wagner @.***> wrote:

I'm sorry, I misinterpreted the results @ali-ramadhan https://github.com/ali-ramadhan posted. I thought that CenteredSecondOrder was 1.0x slower with julia 1.6 than with 1.5 (and that small slowdowns were observed for the other schemes, which is why I recommended testing the biharmonic scheme.) Now I understand that these results are all for julia 1.6; we are comparing the results with previously obtained benchmarks (not posted) for julia 1.5.

Looking at @tomchor https://github.com/tomchor and @ali-ramadhan https://github.com/ali-ramadhan's results then it looks like simulations with WENO5 are running approximately 6-8 times slower on julia 1.6 than it was on julia 1.5, while other advection schemes (and closures) are unchanged --- correct?

Is the CPU performance of WENO5 roughly equivalent between julia 1.5 and julia 1.6?

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CliMA/Oceananigans.jl/issues/1764#issuecomment-868677634, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEX5KV46VENYCZPAGUMK4LTUSTVVANCNFSM47I24R7Q .

tomchor commented 3 years ago

Has anyone tried or knows how to profile these functions? I feel that it would be much easier to find out what's wrong if we profile WENO5 in Julia 1.5 and 1.6 separately.

glwagner commented 3 years ago

Profiling is a very good idea. It probably makes sense to use an integrated / application profiler (rather than simply timing functions), because WENO5 is itself composed of many small functions and we don't know which one is the bottleneck.

I have never tried profiling on the GPU, but there's some info here: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/

Specifically I think we need to install NSight: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/#NVIDIA-Nsight-Systems

glwagner commented 3 years ago

Here's something: https://github.com/CliMA/Oceananigans.jl/pull/1770

I'm trying to run the benchmarks but they take a while, so that's in progress.

glwagner commented 3 years ago

I believe #1770 does the trick:

[2021/06/25 18:04:55.066] INFO  Writing Advection_schemes_relative_performance_(CPU).html...
              Advection schemes relative performance (GPU)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Architectures β”‚                Schemes β”‚ slowdown β”‚  memory β”‚  allocs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           GPU β”‚    CenteredFourthOrder β”‚  1.36629 β”‚ 1.07711 β”‚ 1.66944 β”‚
β”‚           GPU β”‚    CenteredSecondOrder β”‚      1.0 β”‚     1.0 β”‚     1.0 β”‚
β”‚           GPU β”‚ UpwindBiasedFifthOrder β”‚  1.53522 β”‚ 1.11266 β”‚  1.9781 β”‚
β”‚           GPU β”‚ UpwindBiasedThirdOrder β”‚  1.31322 β”‚ 1.03505 β”‚ 1.30432 β”‚
β”‚           GPU β”‚                  WENO5 β”‚  1.84272 β”‚  1.1889 β”‚ 2.64008 β”‚

would be good to get confirmation from someone.

tomchor commented 3 years ago

Confirmed and approved. Would be good to release a new version with this asap.

hennyg888 commented 3 years ago

Just ran benchmark_shallow_water_model.jl with the fixes introduced in #1770. There is indeed a notable increase in speedup compared to no specified advection scheme as shown in #1722. The CPU to GPU speedup went up from ~180 times to ~400 times.

Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  EBVERSIONJULIA = 1.6.0
  JULIA_DEPOT_PATH = :
  EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
  EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
  JULIA_LOAD_PATH = :
  GPU: Tesla V100-SXM2-32GB

                                              Shallow water model benchmarks
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Architectures β”‚ Float_types β”‚    Ns β”‚        min β”‚     median β”‚       mean β”‚        max β”‚    memory β”‚ allocs β”‚ samples β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           CPU β”‚     Float64 β”‚    32 β”‚   2.677 ms β”‚   2.876 ms β”‚   3.047 ms β”‚   4.806 ms β”‚  1.36 MiB β”‚   2253 β”‚      10 β”‚
β”‚           CPU β”‚     Float64 β”‚    64 β”‚   5.795 ms β”‚   5.890 ms β”‚   6.073 ms β”‚   7.770 ms β”‚  1.36 MiB β”‚   2255 β”‚      10 β”‚
β”‚           CPU β”‚     Float64 β”‚   128 β”‚  16.979 ms β”‚  17.350 ms β”‚  17.578 ms β”‚  19.993 ms β”‚  1.36 MiB β”‚   2255 β”‚      10 β”‚
β”‚           CPU β”‚     Float64 β”‚   256 β”‚  62.543 ms β”‚  63.222 ms β”‚  63.544 ms β”‚  67.347 ms β”‚  1.36 MiB β”‚   2255 β”‚      10 β”‚
β”‚           CPU β”‚     Float64 β”‚   512 β”‚ 250.149 ms β”‚ 251.023 ms β”‚ 251.092 ms β”‚ 252.389 ms β”‚  1.36 MiB β”‚   2315 β”‚      10 β”‚
β”‚           CPU β”‚     Float64 β”‚  1024 β”‚ 990.901 ms β”‚ 993.115 ms β”‚ 993.360 ms β”‚ 996.091 ms β”‚  1.36 MiB β”‚   2315 β”‚       6 β”‚
β”‚           CPU β”‚     Float64 β”‚  2048 β”‚    4.002 s β”‚    4.004 s β”‚    4.004 s β”‚    4.007 s β”‚  1.36 MiB β”‚   2315 β”‚       2 β”‚
β”‚           CPU β”‚     Float64 β”‚  4096 β”‚   16.371 s β”‚   16.371 s β”‚   16.371 s β”‚   16.371 s β”‚  1.36 MiB β”‚   2315 β”‚       1 β”‚
β”‚           CPU β”‚     Float64 β”‚  8192 β”‚   64.657 s β”‚   64.657 s β”‚   64.657 s β”‚   64.657 s β”‚  1.36 MiB β”‚   2315 β”‚       1 β”‚
β”‚           CPU β”‚     Float64 β”‚ 16384 β”‚  290.423 s β”‚  290.423 s β”‚  290.423 s β”‚  290.423 s β”‚  1.36 MiB β”‚   2315 β”‚       1 β”‚
β”‚           GPU β”‚     Float64 β”‚    32 β”‚   3.468 ms β”‚   3.656 ms β”‚   3.745 ms β”‚   4.695 ms β”‚  1.82 MiB β”‚   5687 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚    64 β”‚   3.722 ms β”‚   3.903 ms β”‚   4.050 ms β”‚   5.671 ms β”‚  1.82 MiB β”‚   5687 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚   128 β”‚   3.519 ms β”‚   3.808 ms β”‚   4.042 ms β”‚   6.372 ms β”‚  1.82 MiB β”‚   5687 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚   256 β”‚   3.822 ms β”‚   4.153 ms β”‚   4.288 ms β”‚   5.810 ms β”‚  1.82 MiB β”‚   5687 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚   512 β”‚   4.637 ms β”‚   4.932 ms β”‚   4.961 ms β”‚   5.728 ms β”‚  1.82 MiB β”‚   5765 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚  1024 β”‚   3.240 ms β”‚   3.424 ms β”‚   3.527 ms β”‚   4.553 ms β”‚  1.82 MiB β”‚   5799 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚  2048 β”‚  10.783 ms β”‚  10.800 ms β”‚  11.498 ms β”‚  17.824 ms β”‚  1.98 MiB β”‚  16305 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚  4096 β”‚  41.880 ms β”‚  41.911 ms β”‚  42.485 ms β”‚  47.627 ms β”‚  2.67 MiB β”‚  61033 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚  8192 β”‚ 166.751 ms β”‚ 166.800 ms β”‚ 166.847 ms β”‚ 167.129 ms β”‚  5.21 MiB β”‚ 227593 β”‚      10 β”‚
β”‚           GPU β”‚     Float64 β”‚ 16384 β”‚ 681.129 ms β”‚ 681.249 ms β”‚ 681.301 ms β”‚ 681.583 ms β”‚ 16.59 MiB β”‚ 973627 β”‚       8 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

        Shallow water model CPU to GPU speedup
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Float_types β”‚    Ns β”‚  speedup β”‚  memory β”‚  allocs β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     Float64 β”‚    32 β”‚ 0.786715 β”‚ 1.33777 β”‚ 2.52419 β”‚
β”‚     Float64 β”‚    64 β”‚  1.50931 β”‚ 1.33774 β”‚ 2.52195 β”‚
β”‚     Float64 β”‚   128 β”‚  4.55587 β”‚ 1.33774 β”‚ 2.52195 β”‚
β”‚     Float64 β”‚   256 β”‚  15.2238 β”‚ 1.33774 β”‚ 2.52195 β”‚
β”‚     Float64 β”‚   512 β”‚  50.8995 β”‚ 1.33771 β”‚ 2.49028 β”‚
β”‚     Float64 β”‚  1024 β”‚  290.085 β”‚ 1.33809 β”‚ 2.50497 β”‚
β”‚     Float64 β”‚  2048 β”‚  370.777 β”‚ 1.45575 β”‚  7.0432 β”‚
β”‚     Float64 β”‚  4096 β”‚  390.617 β”‚ 1.95667 β”‚ 26.3641 β”‚
β”‚     Float64 β”‚  8192 β”‚  387.632 β”‚ 3.82201 β”‚ 98.3123 β”‚
β”‚     Float64 β”‚ 16384 β”‚   426.31 β”‚  12.177 β”‚ 420.573 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
glwagner commented 3 years ago

thanks @hennyg888 .