Closed tomchor closed 3 years ago
Thanks for reporting this! Definitely weird... I'll see if I can reproduce on a different machine.
cc @xkykai this could be the reason your Supercloud simulations seemed slower?
It might be useful to know the GPU and CUDA version being used (not sure).
I ran the advection scheme benchmarks and comparing with some older Julia 1.5 results it definitely is slower on the GPU. WENO5 used to only be ~3x slower than CenteredSecondOrder, but now it's 26x slower. All other advection schemes are just as fast as they used to be.
Not slow enough to be CUDA scalar operations so maybe the GPU compiler changed in some way that kernels calling/using WENO5 are compiling into suboptimal machine code?
@maleadt might have some ideas/suggestions but maybe we just have to profile and find the new bottleneck?
Advection schemes relative performance (GPU)
βββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ
β Architectures β Schemes β slowdown β memory β allocs β
βββββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββΌββββββββββΌββββββββββ€
β GPU β CenteredFourthOrder β 1.38356 β 1.05911 β 1.60067 β
β GPU β CenteredSecondOrder β 1.0 β 1.0 β 1.0 β
β GPU β UpwindBiasedFifthOrder β 1.53145 β 1.0868 β 1.88203 β
β GPU β UpwindBiasedThirdOrder β 1.30611 β 1.04135 β 1.42012 β
β GPU β WENO5 β 26.1429 β 4.68526 β 38.4468 β
βββββββββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββ
Compare with: https://github.com/CliMA/Oceananigans.jl/pull/1169#issuecomment-725471594
Is the slow down different for different topologies? If so that might be a clue.
CenteredSecondOrder
is special because it directly defines the function advective_momentum_flux_Uu
:
The other schemes define symmetric_interpolate_*
and left_biased_interpolate_*
, etc. For example, CenteredFourthOrder
:
These functions are filtered through an if statement if the dimension is Bounded
, see:
https://github.com/CliMA/Oceananigans.jl/pull/1733 changes the advection schemes a bit and adds the "boundary buffer" as an integer that's known at compile time.
Might be worth testing that PR since it changes the topological condition to use that type information. On master we have
whereas on https://github.com/CliMA/Oceananigans.jl/pull/1733 it's
Yes, I have seen the same thing, as discussed in #1722.
Previously, when looking at speed up on find grids, say 8000^2, I found that the speed up on GPUS was almost 400 times faster compared to the 5th order Upwinding. When @hennyg888 did the same benchmark on julia 1.6, there was still speedup, but only by 200 or so. The speed of the CPUS with 1.5 and 1.6 was very similar. That means that WENO5, in this case, is about half as slow as it used to be.
Sadly, the old data is lost in slack oblivion, but I suppose we could run old could and obtain these results again, if there was desire to do so. But I think this discussion is more fruitful by looking at the nuts and bolts of the method.
Thanks @tomchor for bringing this up as well!
I ran the benchmark again with triply periodic but it's still much slower so the issue might be deeper than the logic in topologically_conditional_interpolation.jl
.
Advection schemes relative performance (GPU)
βββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ
β Architectures β Schemes β slowdown β memory β allocs β
βββββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββΌββββββββββΌββββββββββ€
β GPU β CenteredFourthOrder β 1.50326 β 1.06836 β 1.69674 β
β GPU β CenteredSecondOrder β 1.0 β 1.0 β 1.0 β
β GPU β UpwindBiasedFifthOrder β 1.69787 β 1.09472 β 1.96539 β
β GPU β UpwindBiasedThirdOrder β 1.39899 β 1.05598 β 1.57057 β
β GPU β WENO5 β 33.2728 β 5.21273 β 43.9286 β
βββββββββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββ
diff --git a/benchmark/benchmark_advection_schemes.jl b/benchmark/benchmark_advection_schemes.jl
index 81b083e1..e6ba8cd6 100644
--- a/benchmark/benchmark_advection_schemes.jl
+++ b/benchmark/benchmark_advection_schemes.jl
@@ -7,7 +7,8 @@ using Benchmarks
# Benchmark function
function benchmark_advection_scheme(Arch, Scheme)
- grid = RegularRectilinearGrid(size=(192, 192, 192), extent=(1, 1, 1))
+ topo = (Periodic, Periodic, Periodic)
+ grid = RegularRectilinearGrid(topology=topo, size=(192, 192, 192), extent=(1, 1, 1))
model = IncompressibleModel(architecture=Arch(), grid=grid, advection=Scheme())
time_step!(model, 1) # warmup
Yeah, I've run quite a bit of tests at this point, and the issue seems persistent and (as far as I could tell) independent of topology (although I haven't tried every single topology option).
Thanks for looking into this, btw. Let's hope it's something simple. Let me know how I can help.
I guess another thing to check is if there is a similar slowdown with biharmonic diffusivity. In that case we might be able to pin the problem on the larger stencil, perhaps.
In #1722 we found using 5th order Upwinding, on a grid with 128^3, that the speed up was about 80. Is the above really saying its 1.69? If so then things have certainly changed from 28 days ago.
@francispoulin Ah 1.69 is how much slower UpwindBiasedFifthOrder
is on the GPU instead of CenteredSecondOrder
(also on the GPU). Below are the raw benchmarks and the CPU -> GPU speedups which show a speedup of ~114x for UpwindBiasedFifthOrder
on 192^3 which should agree better with your figure of ~80x.
Actually looking at the advection scheme benchmarks more closely it looks like WENO5 is incurring lots of CPU allocations. According to https://github.com/CliMA/Oceananigans.jl/pull/1169#issuecomment-725471594 changing the advection did not change the number of allocations, but now it does and WENO5 allocates much more memory than the other schemes.
@glwagner I posted the turbulence closure benchmarks below and they seem fine/unchanged.
Advection scheme benchmarks
βββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββ¬βββββββββ¬ββββββββββ
β Architectures β Schemes β min β median β mean β max β memory β allocs β samples β
βββββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌβββββββββΌββββββββββ€
β CPU β CenteredFourthOrder β 1.541 s β 1.545 s β 1.545 s β 1.548 s β 1.61 MiB β 2096 β 4 β
β CPU β CenteredSecondOrder β 1.029 s β 1.035 s β 1.036 s β 1.048 s β 1.61 MiB β 2096 β 5 β
β CPU β UpwindBiasedFifthOrder β 2.250 s β 2.251 s β 2.251 s β 2.252 s β 1.61 MiB β 2096 β 3 β
β CPU β UpwindBiasedThirdOrder β 1.589 s β 1.594 s β 1.594 s β 1.599 s β 1.61 MiB β 2096 β 4 β
β CPU β WENO5 β 6.339 s β 6.339 s β 6.339 s β 6.339 s β 1.61 MiB β 2096 β 1 β
β GPU β CenteredFourthOrder β 17.309 ms β 17.419 ms β 18.107 ms β 24.384 ms β 2.71 MiB β 27650 β 10 β
β GPU β CenteredSecondOrder β 10.369 ms β 11.588 ms β 11.472 ms β 11.642 ms β 2.53 MiB β 16296 β 10 β
β GPU β UpwindBiasedFifthOrder β 19.561 ms β 19.675 ms β 20.975 ms β 32.694 ms β 2.77 MiB β 32028 β 10 β
β GPU β UpwindBiasedThirdOrder β 16.131 ms β 16.211 ms β 16.806 ms β 22.239 ms β 2.68 MiB β 25594 β 10 β
β GPU β WENO5 β 382.916 ms β 385.558 ms β 385.368 ms β 386.709 ms β 13.21 MiB β 715860 β 10 β
βββββββββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββββββββ΄βββββββββ΄ββββββββββ
Advection schemes CPU to GPU speedup
ββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Schemes β speedup β memory β allocs β
ββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β CenteredFourthOrder β 88.7159 β 1.6849 β 13.1918 β
β CenteredSecondOrder β 89.3514 β 1.57709 β 7.77481 β
β UpwindBiasedFifthOrder β 114.4 β 1.72647 β 15.2805 β
β UpwindBiasedThirdOrder β 98.3274 β 1.66538 β 12.2109 β
β WENO5 β 16.4404 β 8.22094 β 341.536 β
ββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Turbulence closures relative performance (GPU)
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ
β Architectures β Closures β slowdown β memory β allocs β
βββββββββββββββββΌβββββββββββββββββββββββββββββββββββΌβββββββββββΌββββββββββΌββββββββββ€
β GPU β AnisotropicBiharmonicDiffusivity β 1.5313 β 1.03189 β 1.54697 β
β GPU β AnisotropicDiffusivity β 1.05623 β 1.00582 β 1.01779 β
β GPU β AnisotropicMinimumDissipation β 1.46265 β 1.19908 β 1.26817 β
β GPU β IsotropicDiffusivity β 1.13134 β 1.00607 β 1.07995 β
β GPU β Nothing β 1.0 β 1.0 β 1.0 β
β GPU β SmagorinskyLilly β 1.41905 β 1.30373 β 1.18683 β
β GPU β TwoDimensionalLeith β 1.11312 β 1.06941 β 1.06147 β
βββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββ
Thanks @ali-ramadhan for clarifying that makes much more sense. A speed up for WENO5
is evern worst than what we saw last month, which is pretty much the same as UpwindBiasedFifthOrder
. Hmm....
I'm sorry, I misinterpreted the results @ali-ramadhan posted. I thought that CenteredSecondOrder
was 1.0x slower with julia 1.6 than with 1.5 (and that small slowdowns were observed for the other schemes, which is why I recommended testing the biharmonic scheme.) Now I understand that these results are all for julia 1.6; we are comparing the results with previously obtained benchmarks (not posted) for julia 1.5.
Looking at @tomchor and @ali-ramadhan's results then it looks like simulations with WENO5 are running approximately 6-8 times slower on julia 1.6 than it was on julia 1.5, while other advection schemes (and closures) are unchanged --- correct?
Is the CPU performance of WENO5 roughly equivalent between julia 1.5 and julia 1.6?
Yes, from what I could test so far the CPU performance seems roughly equivalent between versions. Although it would be good if someone else tried to validate that as well.
On Fri, Jun 25, 2021, 09:16 Gregory L. Wagner @.***> wrote:
I'm sorry, I misinterpreted the results @ali-ramadhan https://github.com/ali-ramadhan posted. I thought that CenteredSecondOrder was 1.0x slower with julia 1.6 than with 1.5 (and that small slowdowns were observed for the other schemes, which is why I recommended testing the biharmonic scheme.) Now I understand that these results are all for julia 1.6; we are comparing the results with previously obtained benchmarks (not posted) for julia 1.5.
Looking at @tomchor https://github.com/tomchor and @ali-ramadhan https://github.com/ali-ramadhan's results then it looks like simulations with WENO5 are running approximately 6-8 times slower on julia 1.6 than it was on julia 1.5, while other advection schemes (and closures) are unchanged --- correct?
Is the CPU performance of WENO5 roughly equivalent between julia 1.5 and julia 1.6?
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CliMA/Oceananigans.jl/issues/1764#issuecomment-868677634, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEX5KV46VENYCZPAGUMK4LTUSTVVANCNFSM47I24R7Q .
Has anyone tried or knows how to profile these functions? I feel that it would be much easier to find out what's wrong if we profile WENO5 in Julia 1.5 and 1.6 separately.
Profiling is a very good idea. It probably makes sense to use an integrated / application profiler (rather than simply timing functions), because WENO5 is itself composed of many small functions and we don't know which one is the bottleneck.
I have never tried profiling on the GPU, but there's some info here: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/
Specifically I think we need to install NSight: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/#NVIDIA-Nsight-Systems
Here's something: https://github.com/CliMA/Oceananigans.jl/pull/1770
I'm trying to run the benchmarks but they take a while, so that's in progress.
I believe #1770 does the trick:
[2021/06/25 18:04:55.066] INFO Writing Advection_schemes_relative_performance_(CPU).html...
Advection schemes relative performance (GPU)
βββββββββββββββββ¬βββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ
β Architectures β Schemes β slowdown β memory β allocs β
βββββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββΌββββββββββΌββββββββββ€
β GPU β CenteredFourthOrder β 1.36629 β 1.07711 β 1.66944 β
β GPU β CenteredSecondOrder β 1.0 β 1.0 β 1.0 β
β GPU β UpwindBiasedFifthOrder β 1.53522 β 1.11266 β 1.9781 β
β GPU β UpwindBiasedThirdOrder β 1.31322 β 1.03505 β 1.30432 β
β GPU β WENO5 β 1.84272 β 1.1889 β 2.64008 β
would be good to get confirmation from someone.
Confirmed and approved. Would be good to release a new version with this asap.
Just ran benchmark_shallow_water_model.jl
with the fixes introduced in #1770. There is indeed a notable increase in speedup compared to no specified advection scheme as shown in #1722. The CPU to GPU speedup went up from ~180 times to ~400 times.
Oceananigans v0.58.1
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
EBVERSIONJULIA = 1.6.0
JULIA_DEPOT_PATH = :
EBROOTJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0
EBDEVELJULIA = /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/julia/1.6.0/easybuild/avx2-Core-julia-1.6.0-easybuild-devel
JULIA_LOAD_PATH = :
GPU: Tesla V100-SXM2-32GB
Shallow water model benchmarks
βββββββββββββββββ¬ββββββββββββββ¬ββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββββββββ¬βββββββββ¬ββββββββββ
β Architectures β Float_types β Ns β min β median β mean β max β memory β allocs β samples β
βββββββββββββββββΌββββββββββββββΌββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββΌβββββββββΌββββββββββ€
β CPU β Float64 β 32 β 2.677 ms β 2.876 ms β 3.047 ms β 4.806 ms β 1.36 MiB β 2253 β 10 β
β CPU β Float64 β 64 β 5.795 ms β 5.890 ms β 6.073 ms β 7.770 ms β 1.36 MiB β 2255 β 10 β
β CPU β Float64 β 128 β 16.979 ms β 17.350 ms β 17.578 ms β 19.993 ms β 1.36 MiB β 2255 β 10 β
β CPU β Float64 β 256 β 62.543 ms β 63.222 ms β 63.544 ms β 67.347 ms β 1.36 MiB β 2255 β 10 β
β CPU β Float64 β 512 β 250.149 ms β 251.023 ms β 251.092 ms β 252.389 ms β 1.36 MiB β 2315 β 10 β
β CPU β Float64 β 1024 β 990.901 ms β 993.115 ms β 993.360 ms β 996.091 ms β 1.36 MiB β 2315 β 6 β
β CPU β Float64 β 2048 β 4.002 s β 4.004 s β 4.004 s β 4.007 s β 1.36 MiB β 2315 β 2 β
β CPU β Float64 β 4096 β 16.371 s β 16.371 s β 16.371 s β 16.371 s β 1.36 MiB β 2315 β 1 β
β CPU β Float64 β 8192 β 64.657 s β 64.657 s β 64.657 s β 64.657 s β 1.36 MiB β 2315 β 1 β
β CPU β Float64 β 16384 β 290.423 s β 290.423 s β 290.423 s β 290.423 s β 1.36 MiB β 2315 β 1 β
β GPU β Float64 β 32 β 3.468 ms β 3.656 ms β 3.745 ms β 4.695 ms β 1.82 MiB β 5687 β 10 β
β GPU β Float64 β 64 β 3.722 ms β 3.903 ms β 4.050 ms β 5.671 ms β 1.82 MiB β 5687 β 10 β
β GPU β Float64 β 128 β 3.519 ms β 3.808 ms β 4.042 ms β 6.372 ms β 1.82 MiB β 5687 β 10 β
β GPU β Float64 β 256 β 3.822 ms β 4.153 ms β 4.288 ms β 5.810 ms β 1.82 MiB β 5687 β 10 β
β GPU β Float64 β 512 β 4.637 ms β 4.932 ms β 4.961 ms β 5.728 ms β 1.82 MiB β 5765 β 10 β
β GPU β Float64 β 1024 β 3.240 ms β 3.424 ms β 3.527 ms β 4.553 ms β 1.82 MiB β 5799 β 10 β
β GPU β Float64 β 2048 β 10.783 ms β 10.800 ms β 11.498 ms β 17.824 ms β 1.98 MiB β 16305 β 10 β
β GPU β Float64 β 4096 β 41.880 ms β 41.911 ms β 42.485 ms β 47.627 ms β 2.67 MiB β 61033 β 10 β
β GPU β Float64 β 8192 β 166.751 ms β 166.800 ms β 166.847 ms β 167.129 ms β 5.21 MiB β 227593 β 10 β
β GPU β Float64 β 16384 β 681.129 ms β 681.249 ms β 681.301 ms β 681.583 ms β 16.59 MiB β 973627 β 8 β
βββββββββββββββββ΄ββββββββββββββ΄ββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββββββββ΄βββββββββ΄ββββββββββ
Shallow water model CPU to GPU speedup
βββββββββββββββ¬ββββββββ¬βββββββββββ¬ββββββββββ¬ββββββββββ
β Float_types β Ns β speedup β memory β allocs β
βββββββββββββββΌββββββββΌβββββββββββΌββββββββββΌββββββββββ€
β Float64 β 32 β 0.786715 β 1.33777 β 2.52419 β
β Float64 β 64 β 1.50931 β 1.33774 β 2.52195 β
β Float64 β 128 β 4.55587 β 1.33774 β 2.52195 β
β Float64 β 256 β 15.2238 β 1.33774 β 2.52195 β
β Float64 β 512 β 50.8995 β 1.33771 β 2.49028 β
β Float64 β 1024 β 290.085 β 1.33809 β 2.50497 β
β Float64 β 2048 β 370.777 β 1.45575 β 7.0432 β
β Float64 β 4096 β 390.617 β 1.95667 β 26.3641 β
β Float64 β 8192 β 387.632 β 3.82201 β 98.3123 β
β Float64 β 16384 β 426.31 β 12.177 β 420.573 β
βββββββββββββββ΄ββββββββ΄βββββββββββ΄ββββββββββ΄ββββββββββ
thanks @hennyg888 .
I've noticed a few weeks ago that my scripts were much slower after the Julia 1.6 upgrade (which is preventing me from upgrading). I thought it was due to my Julia 1.6 installation but after some tests I now think it's an Oceananigans issue, specifically with the WENO5 scheme.
I ran the MWE below in both Julia 1.5 (with Oceananigans version 0.57.1) and Julia 1.6 (tried several Oceananigans versions but specifically for this example I'm using Oceananigans version 0.58.5) using GPUs and the speed difference is pretty huge. The interesting part is that this difference only happens if I use WENO5 with a GPU. If I use the 2nd order centered scheme there is no significant difference in time (I haven't tried other schemes) and if I run the script on CPUs the time difference also appears to be small.
Here's the script:
The output for Julia 1.5:
While for Julia 1.6 this is the output after the same amount of wall time:
Has someone else experienced this?
Any ideas as to what might be causing it? I really would like to upgrade my production ready scripts but there's no way I can do it until this issue is resolved unfortunately :/