Parallelism in the vertical

milankl commented 2 years ago

One easy way to parallelise speedy might be to distribute the calculation of the spectral transform across n workers in the vertical, using SharedArrays (documentation here) from Julia's standard library. While this limits us to nx speedups from parallelisation, which might be fine as SpeedyWeather.jl will probably run on small clusters only anyway. Such that for T30 and n=8 levels we can (hopefully) efficiently run on 8 cores, but for n=48,64 levels we could get significant speedups for higher resolution versions of SpeedyWeather.jl (T100-T500). Given the shared memory of this approach, we'll be limited by the 48 cores on A64FX, but that might be absolutely sufficient for now.

milankl commented 2 years ago

As outlined here, written with @eval metaprogramming style one could similar collect all functions that need a vertical loop and think about adding an @distributed here once SharedArrays are set up https://github.com/milankl/SpeedyWeather.jl/blob/f21b69eae24f9be0b0667ebe381671d994b9b14f/src/distributed_vertical.jl#L1-L16

milankl commented 2 years ago

With #117 the @eval-based looping on the level of individiual functions has been removed. We now have, e.g. for the barotropic model a single place in timestep! where the looping over the vertical is happening, hence we moved that loop as far up as possible https://github.com/milankl/SpeedyWeather.jl/blob/4a3ad4a7659bd1ae8437b15617781420b396678e/src/time_integration.jl#L246-L252 this is where any @distributed-like parallelism should be applied, as the evaluation within the loop is conceptually independent across layers.

milankl commented 1 year ago

One thread

julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
475.47 days/day

vs 8 threads (4.1x)

julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
5.50 years/day

vs 16 threads (6.1x)

julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
8.21 years/day

white-alistair commented 1 year ago

@milankl what's the versioninfo for these tests? Where do you suspect the bottlenecks are? What kind of scaling would you be satisfied with?

One thing I noticed is that the speedups here are quite consistent with the parallel mergesort showcased in the original Julia multi- threading blogpost.

milankl commented 1 year ago

That's a good resource, thanks for sharing. In contrast to mergesort the tasks here are quite a bit more expensive, e.g. we can multi-thread this part of the right-hand side completely

    @floop for layer in diagn.layers
        vertical_velocity!(layer,surface,model)     # calculate σ̇ for the vertical mass flux M = pₛσ̇
                                                    # add the RTₖlnpₛ term to geopotential
        linear_pressure_gradient!(layer,progn,model,lf_implicit)
        vertical_advection!(layer,diagn,model)      # use σ̇ for the vertical advection of u,v,T,q

        vordiv_tendencies!(layer,surface,model)     # vorticity advection, pressure gradient term
        temperature_tendency!(layer,surface,model)  # hor. advection + adiabatic term
        humidity_tendency!(layer,model)             # horizontal advection of humidity (nothing for wetcore)
        bernoulli_potential!(layer,S)               # add -∇²(E+ϕ+RTₖlnpₛ) term to div tendency
    end

across layers, which includes several spectral transforms and other expensive operations. So on this I'd expect scaling to be nearly perfect. On the other hand, there's things like the vertical integrations and geopotential which are currently single-threaded (maybe the can be spawned off though). So in short: I haven't done enough profiling to know what parallelism potential there is and at what resolution/number of vertical levels.

SpeedyWeather / SpeedyWeather.jl

Parallelism in the vertical #20