Closed milankl closed 1 year ago
As outlined here, written with @eval
metaprogramming style one could similar collect all functions that need a vertical loop and think about adding an @distributed
here once SharedArrays are set up
https://github.com/milankl/SpeedyWeather.jl/blob/f21b69eae24f9be0b0667ebe381671d994b9b14f/src/distributed_vertical.jl#L1-L16
With #117 the @eval
-based looping on the level of individiual functions has been removed. We now have, e.g. for the barotropic model a single place in timestep!
where the looping over the vertical is happening, hence we moved that loop as far up as possible
https://github.com/milankl/SpeedyWeather.jl/blob/4a3ad4a7659bd1ae8437b15617781420b396678e/src/time_integration.jl#L246-L252
this is where any @distributed
-like parallelism should be applied, as the evaluation within the loop is conceptually independent across layers.
One thread
julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
475.47 days/day
vs 8 threads (4.1x)
julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
5.50 years/day
vs 16 threads (6.1x)
julia> run_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=26,n_days=1,orography=ZonalRidge)
8.21 years/day
@milankl what's the versioninfo
for these tests? Where do you suspect the bottlenecks are? What kind of scaling would you be satisfied with?
One thing I noticed is that the speedups here are quite consistent with the parallel mergesort showcased in the original Julia multi- threading blogpost.
That's a good resource, thanks for sharing. In contrast to mergesort the tasks here are quite a bit more expensive, e.g. we can multi-thread this part of the right-hand side completely
@floop for layer in diagn.layers
vertical_velocity!(layer,surface,model) # calculate σ̇ for the vertical mass flux M = pₛσ̇
# add the RTₖlnpₛ term to geopotential
linear_pressure_gradient!(layer,progn,model,lf_implicit)
vertical_advection!(layer,diagn,model) # use σ̇ for the vertical advection of u,v,T,q
vordiv_tendencies!(layer,surface,model) # vorticity advection, pressure gradient term
temperature_tendency!(layer,surface,model) # hor. advection + adiabatic term
humidity_tendency!(layer,model) # horizontal advection of humidity (nothing for wetcore)
bernoulli_potential!(layer,S) # add -∇²(E+ϕ+RTₖlnpₛ) term to div tendency
end
across layers, which includes several spectral transforms and other expensive operations. So on this I'd expect scaling to be nearly perfect. On the other hand, there's things like the vertical integrations and geopotential which are currently single-threaded (maybe the can be spawned off though). So in short: I haven't done enough profiling to know what parallelism potential there is and at what resolution/number of vertical levels.
One easy way to parallelise speedy might be to distribute the calculation of the spectral transform across
n
workers in the vertical, using SharedArrays (documentation here) from Julia's standard library. While this limits us ton
x speedups from parallelisation, which might be fine as SpeedyWeather.jl will probably run on small clusters only anyway. Such that for T30 andn=8
levels we can (hopefully) efficiently run on 8 cores, but forn=48,64
levels we could get significant speedups for higher resolution versions of SpeedyWeather.jl (T100-T500). Given the shared memory of this approach, we'll be limited by the 48 cores on A64FX, but that might be absolutely sufficient for now.