SpeedyWeather / SpeedyWeather.jl

Play atmospheric modelling like it's LEGO.
https://speedyweather.github.io/SpeedyWeather.jl/dev
MIT License
426 stars 26 forks source link

Performance of the spectral transform #32

Open milankl opened 2 years ago

milankl commented 2 years ago

This is an issue to track to progress in speeding up the spectral transform. We are starting with

Fourier transform

Legendre transform

With the following benchmark (back&forth transform)

https://github.com/milankl/SpeedyWeather.jl/blob/5b6e8102a506946508f5e8188eb75f7b5df48a4e/test/benchmark.jl#L1-L16

Timings are

  6.020 ms (4084 allocations: 580.59 KiB)      # Fourier
  131.585 μs (1345 allocations: 684.67 KiB)    # Legendre
  6.434 ms (5429 allocations: 1.24 MiB)        # Both
milankl commented 2 years ago

This publication might be relevant too https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/ggge.20071

jakebolewski commented 2 years ago

In CliMA we are using https://github.com/jmert/AssociatedLegendrePolynomials.jl for some spherical harmonic calcs.

milankl commented 2 years ago

Amazing, didn't see this package! I guess that replaces https://github.com/milankl/SpeedyWeather.jl/blob/00faf716f3a1023b5e472efe6f8c989c52729eb1/src/legendre.jl#L4-L47 ?

jakebolewski commented 2 years ago

It is really nice, I'm glad we got @jmert to register it!

Check out his blog posts if you have time:
https://jmert.github.io/AssociatedLegendrePolynomials.jl/stable/man/references/#Blog-series

jakebolewski commented 2 years ago

I haven't seen the SHTns library before, we could get a JLL built of the library and a small package developed. We mostly use these for diagnostic purposes in CliMA.

milankl commented 2 years ago

Indeed, for SpeedyWeather.jl they'll only be computed once in the model setup so performance isn't essential. But having a (hopefully ;) bug-free implementation with some nice documentation what's actually done would be amazing. My mathematical understanding of the legendre polynomials isn't ... polished :)

milankl commented 2 years ago

Using inspirations from @jmert's excellent blog articles the backtransform/synthesis (i.e. inverse FFT + inverse Legendre transform) could be written as

import FFTW, BenchmarkTools

function synthesize(alms::Matrix{C}, nθ, nϕ,
                    brfft_plan::FFTW.rFFTWPlan{C}
                    ) where {C<:Complex}
    lmax, mmax = size(alms) .- 1
    R = real(C)

    θ = R(π)/nθ/2:R(π)/nθ:π
    ϕ₀ = π / nϕ         # == first(crange(0.0, 2.0π, nϕ))
    nϕr = nϕ ÷ 2 + 1    # length real-only half of FFT axis; see `rfft()`
    nθh = (nθ + 1) ÷ 2  # number of rings in northern hemisphere

    # preallocate
    Λ = Matrix{R}(undef, size(alms)...)
    gn = zeros(C, nϕr)  # phase factors for northern ring
    gs = zeros(C, nϕr)  # phase factors for southern ring
    map = zeros(R, nθ, nϕ)

    for y in 1:nθh
        ys = nθ - y + 1 # southern ring index
        λlm!(Λ, lmax, mmax, cos(θ[y]))  # λ_ℓ^m(cos θ) factors

        for m in 0:mmax
            accn, accs = zero(C), zero(C)

            for ℓ in m:lmax             # Σ_{ℓ = m}^{lmax}
                term = alms[ℓ+1,m+1] * Λ[ℓ+1,m+1]
                accn += term
                accs += isodd(ℓ + m) ? -term : term
            end
            accn, accs = (accn, accs) .* cis(m * ϕ₀) # ring offset rotation
            gn[m+1] += accn
            gs[m+1] += accs
        end
        map[y, :] = brfft_plan*gn
        fill!(gn, zero(C))
        map[ys, :] = brfft_plan*gs
        fill!(gs, zero(C))
    end
    return map
end

alms = rand(ComplexF64,31,31)
nθ, nϕ = 48, 96
brfft_plan = FFTW.plan_brfft(rand(ComplexF64,nϕ÷2+1),nϕ)

timings are at T31 about 30x faster than what we currently have, using symmetries (zonal & meridional) and an fft plan

julia> @btime M = synthesize($alms,nθ,nϕ,$brfft_plan);
  118.203 μs (173 allocations: 89.36 KiB)

At T31 the recalculation of the associated Legendre polynomials with λlm! takes up about 50% of the timings, such that we could reach 60x faster with precalculating all polynomials at such low resolution. However, at higher resolution it makes sense to not precalculate to avoid the storage of an lmax x mmax x nlat/2 array and also performance-wise precalculation is not essential.

milankl commented 2 years ago

implementation started #38

milankl commented 2 years ago

The new spectral transform in #38 is now 60x faster than in the first post of this issue.

julia> @btime spectral!($geopot_surf_spectral,$geopot_surf_grid,$S);
49.894 μs (50 allocations: 4.00 KiB)

julia> @btime gridded!($geopot_surf_grid,$geopot_surf_spectral,$S);
65.839 μs (50 allocations: 4.00 KiB)

So we are at about 115 μs for the back and forth transformation now, compared to 6.5 ms we started with. This is based on

There's still some spectral leakage that I'll try to adress.

milankl commented 2 years ago

image

Current spectral leakage in geopotential (gravity*m) via

julia> geopot_surf_spectral = B.geopot_surf
julia> geopot_surf_grid = gridded(geopot_surf_spectral)
julia> geopot_surf_spectral2 = spectral(geopot_surf_grid)
julia> geopot_surf_grid2 = gridded(geopot_surf_spectral2)
julia> imshow((geopot_surf_grid-geopot_surf_grid2)')

and in spectral space

image

via

julia> imshow(abs.(geopot_surf_spectral-geopot_surf_spectral2))
milankl commented 2 years ago

Spectral leakage was solved now with #38 using Gaussian latitudes and legendre weights. For the record @samhatfield, SpeedyWeather.jl's new spectral transform will retain an optional flag recompute_legendre (false by default) in case that's interesting in the future. Timings with recomputation at T31, 96x48, Float64

julia> @btime spectral!($geopot_surf_spectral,$geopot_surf_grid,$S);
98.499 μs (170 allocations: 5.88 KiB)
julia> @btime gridded!($geopot_surf_grid,$geopot_surf_spectral,$S);
107.367 μs (170 allocations: 5.88 KiB)
which is about 1.8x slower than precomputed (see above). However, the precomputed Legendre polynomials take up quite some space (Float64), compared to the speedup from precomputation resolution size speedup
T31 0.2 MB 1.75x
T42 0.5 MB 2x
T85 3.8 MB 2.2x
T170 29.9 MB 2.35x
T341 239.5 MB 2.35x
T682 1.9 GB 1.8x
T1365 15.3 GB untested

I reckon that this trend continues for higher resolutions, such that eventually recomputing makes absolute sense. @jmert that's probably what you were hinting at in your blog articles. Having said that, at the resolutions we'll probably run SpeedyWeather.jl at, recomputation does not make sense.

jmert commented 2 years ago

I reckon that this trend continues for higher resolutions, such that eventually recomputing makes absolute sense. @jmert that's probably what you were hinting at in your blog articles.

Yes, the memory increase is one reason I had in mind for computing the Legendre polynomials on demand. The other ends up being that on some very high-performance computers, I've seen that recomputing is faster than fetching from memory when the entire system is very busy (i.e. spare memory bandwidth is low) and you're doing something like summing over ℓ on a fixed-m slice. (So not directly applicable to spectral transforms in this context, but worth alluding to in my article.)

milankl commented 2 years ago

Avoiding some unnecessary allocations in get_legendre_polynomials that somehow got introduced since, performance with #66 is now at

julia> @btime SpeedyWeather.spectral!($vor1,$vor_grid1,$S);
  36.981 μs (74 allocations: 5.50 KiB)

julia> @btime SpeedyWeather.gridded!($vor_grid1,$vor1,$S);
  49.693 μs (74 allocations: 5.50 KiB)

So about 85μs for the return transform at T31 & Float64 (but Float32 isn't faster atm), means we are now at 76x compared to where we started from.

milankl commented 2 years ago

117 introduced LowerTriangularMatrix in the spectral transforms, shaving off about 30% (see here), now 479531c

New timings are (Float64 only, Float32 still much slower which I can't explain yet)

T31

julia> alms64 = LowerTriangularMatrix(randn(ComplexF64,33,32));
julia> map64 = gridded(alms64);
julia> p,d,m64 = initialize_speedy(Float64,trunc=31,grid=FullGaussianGrid);
julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
  24.885 μs (58 allocations: 4.12 KiB)

julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
  32.739 μs (58 allocations: 4.12 KiB)

Same for T85

julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
  260.338 μs (138 allocations: 10.50 KiB)

julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
  231.473 μs (138 allocations: 10.50 KiB)

Same for T682

julia> @btime SpeedyWeather.spectral!($alms64,$map64,$m64.spectral_transform);
  119.033 ms (1036 allocations: 80.34 KiB)

julia> @btime SpeedyWeather.gridded!($map64,$alms64,$m64.spectral_transform);
  106.877 ms (1036 allocations: 80.34 KiB)

So about -35% at T31, but -50% at T85 and T682 🥳