Shortening the Legendre Transform over m

milankl commented 2 years ago

This is an issue to collect ideas around the linear vs quadratic vs cubic truncation when looping over the zonal wavenumbers $m$. With #127 we'll do

for m in 1:min(nfreq:mmax+1)

with nfreq = nlon÷2 + 1 and nlon the number of longitude points at a given latitude ring. So for full grids, the loop goes over all m but for reduced grids this can be shortened. IFS seems to be doing the following (j nlon nfreq) at Tco79

 (   1  20   8) (   2  24  10) (   3  28  12) (   4  32  14) (   5  36  16) (   6  40  18) (   7  44  20) (   8  48  22)
 (   9  52  24) (  10  56  26) (  11  60  27) (  12  64  29) (  13  68  31) (  14  72  33) (  15  76  35) (  16  80  36)
 (  17  84  38) (  18  88  40) (  19  92  41) (  20  96  43) (  21 100  44) (  22 104  46) (  23 108  47) (  24 112  49)
 (  25 116  50) (  26 120  52) (  27 124  53) (  28 128  55) (  29 132  56) (  30 136  57) (  31 140  58) (  32 144  60)
 (  33 148  61) (  34 152  62) (  35 156  63) (  36 160  64) (  37 164  65) (  38 168  67) (  39 172  68) (  40 176  69)
 (  41 180  70) (  42 184  71) (  43 188  72) (  44 192  73) (  45 196  74) (  46 200  75) (  47 204  76) (  48 208  77)
 (  49 212  78) (  50 216  79) (  51 220  79) (  52 224  79) (  53 228  79) (  54 232  79) (  55 236  79) (  56 240  79)
 (  57 244  79) (  58 248  79) (  59 252  79) (  60 256  79) (  61 260  79) (  62 264  79) (  63 268  79) (  64 272  79)
 (  65 276  79) (  66 280  79) (  67 284  79) (  68 288  79) (  69 292  79) (  70 296  79) (  71 300  79) (  72 304  79)
 (  73 308  79) (  74 312  79) (  75 316  79) (  76 320  79) (  77 324  79) (  78 328  79) (  79 332  79) (  80 336  79)
 (  81 336  79) (  82 332  79) (  83 328  79) (  84 324  79) (  85 320  79) (  86 316  79) (  87 312  79) (  88 308  79)
 (  89 304  79) (  90 300  79) (  91 296  79) (  92 292  79) (  93 288  79) (  94 284  79) (  95 280  79) (  96 276  79)
 (  97 272  79) (  98 268  79) (  99 264  79) ( 100 260  79) ( 101 256  79) ( 102 252  79) ( 103 248  79) ( 104 244  79)
 ( 105 240  79) ( 106 236  79) ( 107 232  79) ( 108 228  79) ( 109 224  79) ( 110 220  79) ( 111 216  79) ( 112 212  78)
 ( 113 208  77) ( 114 204  76) ( 115 200  75) ( 116 196  74) ( 117 192  73) ( 118 188  72) ( 119 184  71) ( 120 180  70)
 ( 121 176  69) ( 122 172  68) ( 123 168  67) ( 124 164  65) ( 125 160  64) ( 126 156  63) ( 127 152  62) ( 128 148  61)
 ( 129 144  60) ( 130 140  58) ( 131 136  57) ( 132 132  56) ( 133 128  55) ( 134 124  53) ( 135 120  52) ( 136 116  50)
 ( 137 112  49) ( 138 108  47) ( 139 104  46) ( 140 100  44) ( 141  96  43) ( 142  92  41) ( 143  88  40) ( 144  84  38)
 ( 145  80  36) ( 146  76  35) ( 147  72  33) ( 148  68  31) ( 149  64  29) ( 150  60  27) ( 151  56  26) ( 152  52  24)
 ( 153  48  22) ( 154  44  20) ( 155  40  18) ( 156  36  16) ( 157  32  14) ( 158  28  12) ( 159  24  10) ( 160  20   8)

So that's about 3 fewer m around the poles and less than half at the Equator

julia> cat(ifs[1:80],spd,spd-ifs[1:80],dims=2)
80×3 Matrix{Int64}:
  8  10  2
 10  12  2
 12  14  2
 14  16  2
 16  18  2
 18  20  2
 20  22  2
 22  24  2
 24  26  2
 26  28  2
 27  30  3
 29  32  3
 31  34  3
 33  36  3
 35  38  3
 36  40  4
 38  42  4
 40  44  4
 41  46  5
 43  48  5
 44  50  6
 46  52  6
 47  54  7
 49  56  7
 50  58  8
 52  60  8
 ⋮

and apparently follows the formula nfreq = floor((nlon - 1)/(2 + coslat^2))) - 1. At that resolution this is per hemisphere 4784 loops over m compared to 6400 total (80 rings each 80 orders). Meaning there's a good amount of performance gain possible if we follow something similar. What needs to be checked though is how much this applies to other grids and whether we can directly formulate this as a function of the truncation order.

More information on this apparently in Courtier and Naughton 1994

@hottad @samhatfield

samhatfield commented 2 years ago

Because paywalls are 💩, here is the paper:

Quart J Royal Meteoro Soc - July 1994 Part B - Courtier - A pole problem in the reduced Gaussian grid.pdf

milankl commented 2 years ago

@hottad maybe you can clarify, if IFS/ectrans loops until nfreq = floor((nlon - 1)/(2 + coslat^2))) - 1 then towards the equator this is effectively nlon/3, hence quadratic and not cubic. I currently don't understand why you wouldn't want to loop only until nlon/4 at the Equator for a cubic truncation? Visualised this looks like

with IFS being quadratic at the equator but a coslat^2 scaling towards linear near the poles. Could we do something like what I labelled "new", which is also linear near the poles and uses a coslat scaling towards cubic at the equator. That would save another 10% in the Legendre transform compared to IFS. The first few rings in comparison (ring, nlon, ifs, new)

 1  20   8  11
 2  24  10  13
 3  28  12  15
 4  32  14  16
 5  36  16  18
 6  40  18  19
 7  44  20  21
 8  48  22  22

which is for new truely linear for ring 1,2,3 and then only drops down slowly

hottad commented 2 years ago

Hi Milan @milankl, Thank you for this plot! I honestly do not understand much about this, but I found something interesting about IFS formula.

In addition to zonal aliasing, there is another factor to consider when we determine the size of loop over m. At latitudes outside of the Tropics, the associated Legendre function decays quickly as m gets larger. Thus, if we define some threshold ϵ below which we can ignore $P_n^m(\sin\rm{lat)}$, we can define m above which we do not need to loop over when performing direct or inverse Legendre transform. If we call such m as m_to_retain(ϵ), I found that m_to_retain(ϵ) is rather insensitive to the exact choice of ϵ, and interestingly, the IFS formula gives a good approximation to m_to_retain(ϵ).

This is a plot for O80 grid (nlat_half=80) with Nmax=79 truncation (cubic). As we can see, IFS formula and m_to_retain(ϵ=1e-6) agree quite well. Please have a look at this Jupyter notebook https://gist.github.com/hottad/5536882443cb5d018bbd57a88f10eb0e for precise definitions and derivation (and code).

I do not know if this is related to how IFS formula is devised, but seems like it's a too good agreement for being just a coincidence.

hottad commented 2 years ago

In the above post I forgot to mention that this m_to_retain plot is basically a reproduction of Figure 3 (and other similar plots) of Courtier and Naughton 1994, except that in Courtier and Naughton they used quadratic truncation whereas I plotted for cubic truncation.

I computed up to Tc2559 with the same code. The results below: tile_m_truncation_comparison IFS formula approximates m_to_retain(ϵ=1e-6) quite well at lower resolutions, but as the resolution (truncation wavenumber) increases, it starts to deviate from m_to_retain(ϵ=1e-6) outside the polar areas.

Extrapolating the trend, I guess your "new" formula will be a better alternative than IFS formula at very high resolutions beyond, like Tc3999, but perhaps that's outside the scope of SpeedyWeather? If we use the "new" formula at the resolutions tested here, we will ignore a lot of $P_n^m(\sin\rm{lat})$ that are not small, so we may end up with inexact associate Legendre transform.

Alternatively, do you think it makes sense to compute m_to_retain(ϵ) for some reasonable ϵ within the constructor of SpectralTransform type? This may slow down the setup process, but will save the cost of transforms in comparison to IFS formula, especially at a relatively high resolution.

In any case, we should decide which way to go by doing experiments.

@samhatfield @milankl Any thoughts?

milankl commented 2 years ago

Quick comment: I believe the dip in m_to_retain between 500 and 1700 at T2559 is due to issues with the computation of the Legendre polynomials at very high resolution similar to https://github.com/jmert/AssociatedLegendrePolynomials.jl/issues/27 (which is the package we are currently using to do that).

milankl commented 2 years ago

General comment: I agree that we would need to test these ideas while actually running the model. Daisuke, do you know of a good test case to do that?

In general, I like the m_to_retain idea, but I'm not sold that this is actually what we want: The reason is that a cubic truncation is a form of filtering, so we actually want a certain error if that means we can scale-selectively filter out some high-frequency waves. I believe all of these methods are idempotent, meaning you may start with a field that contains some waves that shouldn't be representable with a cubic truncation, but once they are filtered out the transform should approach exactness (up to rounding errors). In that sense, I see the m_to_retain idea as an upper bound of m beyond which we shouldn't loop, but maybe we do want to shortcut the loop over m even further.

Going forward I suggest we create a parameter like shortcut_legendre::Symbol with options like :linear, :quadratic, :cubic, :ifs, :lincub_coslat which implements exactly these formulas and precomputes them in SpectralTransform. In spectral! we then just load those such that we don't have to play with the spectral transforms just with the precomputation when initiating SpectralTransform.

milankl commented 2 years ago

And I've just created a plot that compares m_to_retain, ifs and lincub_coslat which is the linear to cubic transition via coslat scaling. And I've added the savings in % (total loop iterations over m for all latitude rings) relative to 1:mmax+1 for every ring

hottad commented 2 years ago

Thank you Milan. I agree. m_to_retain should be interpreted as the upper bound of the size of m-loop.

What is not entirely clear to me is if we can ensure orthonormality of $Y_n^m(\rm{lon},\rm{lat})$ if we reduce m beyond m_to_retain. Perhaps, before trying with model runs, we should start by checking the orthonormality (I mean, if we initialize alm with (l,m)-element having one and all the others zero, and doing a round-trip of transform by successively calling gridded!() and then spectral!(), does that perfectly (within the desired tolerance) restore the original alm, and does this hold for any pair of (l,m)?). I guess this is equivalent to what you call idempotency?

Going forward I suggest we create a parameter like shortcut_legendre::Symbol with options like :linear, :quadratic, :cubic, :ifs, :lincub_coslat which implements exactly these formulas and precomputes them in SpectralTransform.

Thanks. That's a good idea. Will you be able to implement and test this?

hottad commented 2 years ago

Quick comment: I believe the dip in m_to_retain between 500 and 1700 at T2559 is due to issues with the computation of the Legendre polynomials at very high resolution similar to jmert/AssociatedLegendrePolynomials.jl#27 (which is the package we are currently using to do that).

This underflow/overflow problem is a well known issue. Just in case you are not aware, there are two known resolutions to this.

One way to resolve this is to use an exponent-extension method (i.e., represent Pnm with a struct (called X-number) which is a pair of an integer (to save the exponent) and a floating point number (either Float32 or Float64), and override operators like mul(Float,X-number) and add(X-number,X-number) with a bespoke method). Detailed explanation and sample code in Fortran are documented in Fukushima (2012, Journal of Geodesy) (it's behind paywall but the author's copy on researchgate.net is available from here). It should be easy to implement X-number in Julia as something like Xnumber{T} <: AbstractFloat where T. Experience from Fortran 90 implementation is that, performance penalty from using X-number representation is quite small.

Alternatively, when you compute Pnm by recurrence formula like Belousov, you can check whether Pnm goes beyond/below some thresholds(like 1e-16 and 1e+16 for Float64) and apply scaling by some factors (e.g., 2^20 and 2^(-20), these can be arbitrary but should be power of 2 to avoid rounding error) if that occurs, and keep track of how many time the scaling was applied for each Pnm. This is what is done for IFS or JMA's global model. Wedi et al. (2013), Section 2, very briefly mentions this strategy in IFS:

Equation (9) is also unstable but can be made stable by tracking the values of the numbers and keeping them within an acceptable range that can be represented with double precision

milankl commented 2 years ago

The first approach Daisuke suggests can already be easily done with AssociatedLegendrePolynomials.jl as it's type-flexible but currently quadmath arithmetics like Float128 aren't fast.

julia> using AssociatedLegendrePolynomials, Quadmath, BenchmarkTools
julia> Λ = zeros(Float32,3001,3001)    # always store in a Float32 array
julia> @btime λlm!(Λ, 3000, 3000, Float32(cos(π/4)));    # the type of cos(colat) determines the format used for calculation
  18.604 ms (5 allocations: 80 bytes)
julia> @btime λlm!(Λ, 3000, 3000, Float64(cos(π/4)));
  22.545 ms (5 allocations: 80 bytes)
julia> @btime λlm!(Λ, 3000, 3000, Float128(cos(π/4)));
  6.238 s (6 allocations: 192 bytes)

Although I have quite some expertise in writing new number formats, I'm not particularly keen to implement an X-number myself. On the other hand, how hard can it be? 😆 Good thing in Julia is, you don't have to overload the arithmetics.

The 2nd approach would require some tweaks to AssociatedLegendrePolynomials, @jmert you may find this interesting.

SpeedyWeather / SpeedyWeather.jl

Shortening the Legendre Transform over m #131