Open milankl opened 2 years ago
This publication might be relevant too https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/ggge.20071
In CliMA we are using https://github.com/jmert/AssociatedLegendrePolynomials.jl for some spherical harmonic calcs.
Amazing, didn't see this package! I guess that replaces https://github.com/milankl/SpeedyWeather.jl/blob/00faf716f3a1023b5e472efe6f8c989c52729eb1/src/legendre.jl#L4-L47 ?
It is really nice, I'm glad we got @jmert to register it!
Check out his blog posts if you have time:
https://jmert.github.io/AssociatedLegendrePolynomials.jl/stable/man/references/#Blog-series
I haven't seen the SHTns library before, we could get a JLL built of the library and a small package developed. We mostly use these for diagnostic purposes in CliMA.
Indeed, for SpeedyWeather.jl they'll only be computed once in the model setup so performance isn't essential. But having a (hopefully ;) bug-free implementation with some nice documentation what's actually done would be amazing. My mathematical understanding of the legendre polynomials isn't ... polished :)
Using inspirations from @jmert's excellent blog articles the backtransform/synthesis (i.e. inverse FFT + inverse Legendre transform) could be written as
import FFTW, BenchmarkTools
function synthesize(alms::Matrix{C}, nθ, nϕ,
brfft_plan::FFTW.rFFTWPlan{C}
) where {C<:Complex}
lmax, mmax = size(alms) .- 1
R = real(C)
θ = R(π)/nθ/2:R(π)/nθ:π
ϕ₀ = π / nϕ # == first(crange(0.0, 2.0π, nϕ))
nϕr = nϕ ÷ 2 + 1 # length real-only half of FFT axis; see `rfft()`
nθh = (nθ + 1) ÷ 2 # number of rings in northern hemisphere
# preallocate
Λ = Matrix{R}(undef, size(alms)...)
gn = zeros(C, nϕr) # phase factors for northern ring
gs = zeros(C, nϕr) # phase factors for southern ring
map = zeros(R, nθ, nϕ)
for y in 1:nθh
ys = nθ - y + 1 # southern ring index
λlm!(Λ, lmax, mmax, cos(θ[y])) # λ_ℓ^m(cos θ) factors
for m in 0:mmax
accn, accs = zero(C), zero(C)
for ℓ in m:lmax # Σ_{ℓ = m}^{lmax}
term = alms[ℓ+1,m+1] * Λ[ℓ+1,m+1]
accn += term
accs += isodd(ℓ + m) ? -term : term
end
accn, accs = (accn, accs) .* cis(m * ϕ₀) # ring offset rotation
gn[m+1] += accn
gs[m+1] += accs
end
map[y, :] = brfft_plan*gn
fill!(gn, zero(C))
map[ys, :] = brfft_plan*gs
fill!(gs, zero(C))
end
return map
end
alms = rand(ComplexF64,31,31)
nθ, nϕ = 48, 96
brfft_plan = FFTW.plan_brfft(rand(ComplexF64,nϕ÷2+1),nϕ)
timings are at T31 about 30x faster than what we currently have, using symmetries (zonal & meridional) and an fft plan
julia> @btime M = synthesize($alms,nθ,nϕ,$brfft_plan);
118.203 μs (173 allocations: 89.36 KiB)
At T31 the recalculation of the associated Legendre polynomials with λlm!
takes up about 50% of the timings, such that we could reach 60x faster with precalculating all polynomials at such low resolution. However, at higher resolution it makes sense to not precalculate to avoid the storage of an lmax x mmax x nlat/2 array and also performance-wise precalculation is not essential.
implementation started #38
The new spectral transform in #38 is now 60x faster than in the first post of this issue.
julia> @btime spectral!($geopot_surf_spectral,$geopot_surf_grid,$S);
49.894 μs (50 allocations: 4.00 KiB)
julia> @btime gridded!($geopot_surf_grid,$geopot_surf_spectral,$S);
65.839 μs (50 allocations: 4.00 KiB)
So we are at about 115 μs for the back and forth transformation now, compared to 6.5 ms we started with. This is based on
There's still some spectral leakage that I'll try to adress.
Current spectral leakage in geopotential (gravity*m) via
julia> geopot_surf_spectral = B.geopot_surf
julia> geopot_surf_grid = gridded(geopot_surf_spectral)
julia> geopot_surf_spectral2 = spectral(geopot_surf_grid)
julia> geopot_surf_grid2 = gridded(geopot_surf_spectral2)
julia> imshow((geopot_surf_grid-geopot_surf_grid2)')
and in spectral space
via
julia> imshow(abs.(geopot_surf_spectral-geopot_surf_spectral2))
Spectral leakage was solved now with #38 using Gaussian latitudes and legendre weights. For the record @samhatfield, SpeedyWeather.jl's new spectral transform will retain an optional flag recompute_legendre
(false by default) in case that's interesting in the future. Timings with recomputation at T31, 96x48, Float64
julia> @btime spectral!($geopot_surf_spectral,$geopot_surf_grid,$S);
98.499 μs (170 allocations: 5.88 KiB)
julia> @btime gridded!($geopot_surf_grid,$geopot_surf_spectral,$S);
107.367 μs (170 allocations: 5.88 KiB)
which is about 1.8x slower than precomputed (see above). However, the precomputed Legendre polynomials take up quite some space (Float64), compared to the speedup from precomputation | resolution | size | speedup |
---|---|---|---|
T31 | 0.2 MB | 1.75x | |
T42 | 0.5 MB | 2x | |
T85 | 3.8 MB | 2.2x | |
T170 | 29.9 MB | 2.35x | |
T341 | 239.5 MB | 2.35x | |
T682 | 1.9 GB | 1.8x | |
T1365 | 15.3 GB | untested |
I reckon that this trend continues for higher resolutions, such that eventually recomputing makes absolute sense. @jmert that's probably what you were hinting at in your blog articles. Having said that, at the resolutions we'll probably run SpeedyWeather.jl at, recomputation does not make sense.
I reckon that this trend continues for higher resolutions, such that eventually recomputing makes absolute sense. @jmert that's probably what you were hinting at in your blog articles.
Yes, the memory increase is one reason I had in mind for computing the Legendre polynomials on demand. The other ends up being that on some very high-performance computers, I've seen that recomputing is faster than fetching from memory when the entire system is very busy (i.e. spare memory bandwidth is low) and you're doing something like summing over ℓ on a fixed-m slice. (So not directly applicable to spectral transforms in this context, but worth alluding to in my article.)
Avoiding some unnecessary allocations in get_legendre_polynomials
that somehow got introduced since, performance with #66 is now at
julia> @btime SpeedyWeather.spectral!($vor1,$vor_grid1,$S);
36.981 μs (74 allocations: 5.50 KiB)
julia> @btime SpeedyWeather.gridded!($vor_grid1,$vor1,$S);
49.693 μs (74 allocations: 5.50 KiB)
So about 85μs for the return transform at T31 & Float64 (but Float32 isn't faster atm), means we are now at 76x compared to where we started from.
LowerTriangularMatrix
in the spectral transforms, shaving off about 30% (see here), now 479531cmuladd
for the Legendre transformlon_offset
, avoiding costly cis((m-1)*lon_offset)
operationsNew timings are (Float64 only, Float32 still much slower which I can't explain yet)
T31
julia> alms64 = LowerTriangularMatrix(randn(ComplexF64,33,32));
julia> map64 = gridded(alms64);
julia> p,d,m64 = initialize_speedy(Float64,trunc=31,grid=FullGaussianGrid);
julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
24.885 μs (58 allocations: 4.12 KiB)
julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
32.739 μs (58 allocations: 4.12 KiB)
Same for T85
julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
260.338 μs (138 allocations: 10.50 KiB)
julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
231.473 μs (138 allocations: 10.50 KiB)
Same for T682
julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
119.033 ms (1036 allocations: 80.34 KiB)
julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
106.877 ms (1036 allocations: 80.34 KiB)
So about -35% at T31, but -50% at T85 and T682 🥳
This is an issue to track to progress in speeding up the spectral transform. We are starting with
Fourier transform
Legendre transform
@inbounds
dot
(probably OpenBLAS)With the following benchmark (back&forth transform)
https://github.com/milankl/SpeedyWeather.jl/blob/5b6e8102a506946508f5e8188eb75f7b5df48a4e/test/benchmark.jl#L1-L16
Timings are