Comparison of astropy, NUFFT, and phase winding

I've been looking into the performance of astropy's FFT-based Lomb-Scargle implementation and whether it could be improved with a finufft-based solution, or possibly a brute-force solution that uses phase-winding to reduce the sin/cos cost. In parallel, I've been looking into GPU implementations (cufinufft and CUDA phase-winding), but I'll stick to CPU implementations in this issue!

CC: @dfm (and @ahbarnett in case you're interested!)

The Methods

Here are the methods I'm comparing in the plots below:

astropy_fft: astropy's default FFT-based method, which uses an approximate "extirpolation" and two numpy FFTs to mimic a NUFFT (source)
astropy_fft (optimized): a tweaked version that swaps out numpy FFT for the faster scipy FFT, and uses single-precision when requested instead of always up-casting to double
finufft: follows the same overall approach as astropy, but uses two finufft calls instead of astropy's approximate NUFFT
finufft (nthread=16): parallelized finufft. Note that astropy doesn't readily support parallelization because the 1D numpy and scipy FFTs aren't parallelized.
winding: a phase-winding implementation that does one sincos() call per NU point, and then uses the trig angle-addition formulas to do the rest of the frequency bins via FMA (#6)

For now I'm using a single object with 3554 observations (the size that Gowanlock et al. used) and a varying number of frequency bins.

The Results

Single precision

Double precision

These plots were generated with the command: python bench/bench.py bench -logNfdelta 0.5 -m winding -m astropy_fft -m astropy_fft_opt -m finufft -m finufft_par -dtype f8 (and -dtype f4 for single precision).

Discussion

In both single and double precision, there is certainly some room to improve on astropy_fft's performance! Even the simple astropy_fft (optimized) is almost a factor of two better in single precision, with less of a gain in double, but in both cases similar to the finufft performance.

It's tempting to consider whether a quick PR into astropy to implement astropy_fft (optimized) would get most of the performance of finufft without the additional dependency, but I think there's two reasons that the finufft version is still superior:

finufft parallelizes, and while finufft (nthread=16) is only a factor of 2 faster instead of 16, it's the clear winner in performance at large Nf;
the finufft version is arbitrarily accurate, while the astropy versions can have noticeable errors (see below). And finufft can be made even faster at lower accuracy.

In just about all cases, the performance of the winding implementation is worse than that of the (NU)FFT-based versions, which is not surprising but good to confirm. I think there's a good chance that it parallelizes best out of all the methods, though, and could even be the fastest overall. However, I'm not positive that getting the fastest single-object method is the priority, and I suspect all these methods parallelize well over multiple objects.

The brute-force sin/cos version is even slower than the winding version, even using highly optimized libraries like Intel SVML (see #6), so I didn't include it in the comparison.

Accuracy

First of all, here's what the spectrum of this fake object looks like:

Visually, we can see some difference between green and orange (finufft and astropy_fft). Nothing that would affect peak finding, but still noticeable.

Doing a more quantitative comparison, we see that winding and finufft agree basically perfectly, and astropy_fft is the outlier:

winding vs finufft
        isclose 100%
        max err 2.657e-09%
        rms err / mean 4.516e-10%
finufft vs astropy_fft
        isclose 14.22%
        max err 7.016%
        rms err / mean 0.7898%

A max error of a few percent and a RMS error of nearly 1% even in double precision seems big, which gives finufft a clear advantage here. Either users could enjoy the increased accuracy combined with increased performance, or they could relax the accuracy and speed up finufft even further.

The plot and stats were produced by python bench/bench.py compare -m winding -m finufft -m astropy_fft -dtype f8.

Next Steps

I think it's worth starting to consider what the right approach would be to package this up. I suspect that Astropy won't want a finufft dependency, and their API currently doesn't allow multiple objects (for parallelizing/batching computations), so that probably means this should live in a standalone package. We could add hooks/monkey patch Astropy to add this as a known method, but that might be fragile. Might be easier to just mimic their API and be standalone. Happy to discuss!

Regardless, I have a few small technical TODOs, like adding support for f0 != 0, center_data, and fit_mean.

Re accuracy, suggest you plot difference of outputs on log10 yscale. 7% max error! Should be emphasized to the astropy community they are getting the wrong answer.

Re finufft use: you will benefit from playing with parameters here. test/finufft1d_test is v useful as a testbed. Your # NU pts is tiny, so the spreading is insignificant (so eg requiring 15 [overkill] vs 9 digits no difference). Thus: it's all about the FFT.

A) thus upsampfac has a big benefit. Looks like you're not setting it to 1.25. You should (2.5x avail):

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft1d_test 1e7 3554 1e-9 0 2
test 1d type 1:
    3554 NU pts to 10000000 modes in 1.39 s     2.55e+03 NU pts/s
    one mode: rel err in F[3700000] is 4.95e-10

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft1d_test 1e7 3554 1e-9 0 2 1.25
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.622 s    5.71e+03 NU pts/s
    one mode: rel err in F[3700000] is 1.28e-09

PS you can't request 15 digits for upsampfac=1.25. 9 is enough!

B) 16 threads (full HT? what's your machine?) may be too many for such small problems - try less. 1D FFTs don't parallelize much anyway, and plan is more expensive for full HT. Watch this on 8-core ryzen:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft1d_test 1e7 3554 1e-9 0 2 1.25
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.625 s    5.69e+03 NU pts/s
    one mode: rel err in F[3700000] is 1.28e-09
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=4 test/finufft1d_test 1e7 3554 1e-9 0 2 1.25
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.331 s    1.07e+04 NU pts/s
    one mode: rel err in F[3700000] is 3.58e-09
base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 test/finufft1d_test 1e7 3554 1e-9 0 2 1.25
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.305 s    1.16e+04 NU pts/s
    one mode: rel err in F[3700000] is 3.58e-09
(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=16 test/finufft1d_test 1e7 3554 1e-9 0 2 1.25
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.656 s    5.41e+03 NU pts/s
    one mode: rel err in F[3700000] is 4.82e-09

16 nearly twice as slow as 8, mostly due to fftw plan way slower. To figure that out, timing breakdown is useful (debug=1):

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 test/finufft1d_test 1e7 3554 1e-9 1 2 1.25
test 1d type 1:
[finufft_makeplan] new plan: FINUFFT version 2.2.0 .................
[finufft_makeplan] 1d1: (ms,mt,mu)=(10000000,1,1) (nf1,nf2,nf3)=(12500000,1,1)
               ntrans=1 nthr=8 batchSize=1 
[finufft_makeplan] kernel fser (ns=15):     0.0197 s
[finufft_makeplan] fwBatch 0.20GB alloc:    3.1e-05 s
[finufft_makeplan] FFTW plan (mode 64, nthr=8): 0.00304 s
[finufft_setpts] sort (didSort=1):      0.00389 s
[finufft_execute] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute] done. tot spread:     0.0853 s
               tot FFT:             0.0833 s
               tot deconvolve:          0.0797 s
    3554 NU pts to 10000000 modes in 0.301 s    1.18e+04 NU pts/s
    one mode: rel err in F[3700000] is 3.58e-09

Playing around with this you see FFT only ever gets 2x faster. But "kernel fser" step parallelizes well to 8x faster. Also: FFTW plan slow, esp at full HT. It and kernel fser would be amortized if you kept the finufft plan around for repeated calls.

C) Thus you may benefit from guru (plan) interface to finufft, planning once for the size, esp if you use fftw_plan = MEASURE, which is v slow but can give a faster execute. I assume application would want this (multiple LS calls at the same M), so you'd want to make your LS code have a plan interface too, to allow users to benefit. Batching gets you 2x, eg:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=8 test/finufft1dmany_test 100 1e6 3554 1e-9 0 0 0 2 1.25
test 1d1 many vs repeated single: ------------------------------------
ntr=100: 3554 NU pts to 1000000 modes in 2.53 s     1.41e+05 NU pts/s
    one mode: rel err in F[370000] of trans#99 is 5.88e-10
100 of: 3554 NU pts to 1000000 modes in 4.94 s  7.19e+04 NU pts/s
            speedup      T_FINUFFT1D1 / T_finufft1d1many = 1.96

CPU conclusions: by avoiding bad fftw planning, and upsampfac, on 8-core I got 1d1 down from 1.0 sec to 0.17s per call, 6x speedup, even without batching:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=16 test/finufft1d_test 1e7 3554 1e-15 0 2 2.0
test 1d type 1:
    3554 NU pts to 10000000 modes in 0.966 s    3.68e+03 NU pts/s
    one mode: rel err in F[3700000] is 4.98e-10

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=16 perftest/guru_timing_test 10 1 1 1e7 0 0 3554 1e-9 0 0 0 2 1.25
FINUFFT 1d1 use guru interface to do 10 calls together:-------------------
    plan, for 10000000 modes:       0.123 s
    setpts for 3554 NU pts:         0.0033 s
    exec                    1.56 s
ntr=10: 3554 NU pts to 10000000 modes in 1.68 s     2.11e+04 NU pts/s

Please make sure you're using the above tricks!

Good luck w/ GPU - that's what most of recent papers seemed to be. basically it's just speed of cufft you'll see. Getting H2D/D2H may be bottleneck instead... unless a workflow keeps data on GPU?

Best, Alex

Thanks Alex, that's extremely helpful! I didn't put much effort into tuning finufft because it was already the winner, but using upsampfac=1.25 and FFTW_MEASURE gives a nice additional speedup, especially for large frequency grids:

Relatedly, I was wondering if you might have an insight here: each periodogram requires two NUFFTs, each of length 2N_uniform (the negative frequency half just gets thrown out). I know that one can pack two R2C FFTs into a single C2C FFT, but is there an equivalent trick for two NUFFTs if the NU points are different? One has NU points t, the other `2t`. Here's where I'm calling finufft: link.

In terms of the benchmarking, I was using 16 cores on an Icelake cluster node. No hyper-threading, and the CPU scaling governor was set to performance. I was also pinning OpenMP threads to cores.

I'm not sure how often batching with the same NU points will come up in real applications, maybe @dfm would know. Gowanlock's "batch mode" assumed different NU points for each object, for example. But hopefully even non-batched applications that use the same uniform grid size will be able to amortize the cost of FFTW_MEASURE—it can be pretty slow!

Here's the accuracy plots, now showing the fractional error on a log scale:

finufft looks great in float64; not sure if there's a way to tease out more precision in float32? I'm using eps=1e-6 there. It's also possible the loss of precision is occurring in the post-processing of the NUFFTs into the periodogram; I haven't checked yet.

Great. MEASURE is v slow, but looks like for float32 gives you speedup.

Exploiting half-length transform is easy: just rephase the inputs (and get the tricky mod-2 correct; there's 4 cases of Nf%4, which I checked). Watch this!:

# demo for Lehman Garrison & DFM re phased half-length 1d1 to avoid computing
# then discarding negative output frequency coeffs. Barnett 2/2/24.

import numpy as np
import time
import finufft

Nf=1_000_000  # modes wanted
# dummy data:
N = 3554       # NU
t=2*np.pi*np.random.rand(N,)                        # locs
yw = np.random.randn(N,) + 1j*np.random.randn(N,)   # strengths

# Lehman's original use
# edited from https://github.com/dfm/nufft-ls/blob/617ee63fa28b392e8389982a8c7d9f433ea79782/bench/bench.py#L69
plan = finufft.Plan(nufft_type=1, n_modes_or_dim=(Nf,), n_trans=1, eps=1e-9, dtype=np.complex128, upsampfac=1.25)
plan.setpts(t)
timer=time.time()
f1 = plan.execute(yw)
print("time (sec): ",time.time()-timer)
f1 = f1[Nf//2:]          # Lehman keeps freqs 0,1,...,(Nf-1)//2
del plan

# show how to do it with half-length transform (-> twice as fast!)
Nh = (Nf+1)//2
plan = finufft.Plan(nufft_type=1, n_modes_or_dim=(Nh,), n_trans=1, eps=1e-9, dtype=np.complex128, upsampfac=1.25)
plan.setpts(t)
Nshift = Nh//2   # integer change in mode freqs vs original transform
# sign of 1j here would depends on isign in transform...
timer=time.time()
f1h = plan.execute(yw * np.exp(1j*Nshift*t))   # phase the strengths
print("time (sec): ",time.time()-timer)
del plan

print("max err in half-length for nonneg freqs = ",max(np.abs(f1-f1h)))
# loss of digits rel to eps is unavoidable in big 1d transform, I think

Gives:

In [125]: %run avoid_discard_neg_freqs_1d1.py
time (sec):  0.019863367080688477
time (sec):  0.009358644485473633
max err in half-length for nonneg freqs =  5.395172720397828e-06

Make sure you work through the math so you grok the trick.

I think the loss for large Nf is unavoidable here, with a NUFFT. (see our 2018 paper, error discussion, where a factor Nf enters (we call it N1)). Now you have another speedup though.

I won't PR it since I don't know.

Your log10 plots are much more informative. Since L-S divides by CC it could be ill-cond (amplify the nufft error) - I don't know.

Not batching, but just saving and reusing the finufft plan, is what I was suggesting in the end.

Cheers, Alex

Thanks, that's perfect! I knew I had to be missing a trick.

With those optimizations, finufft is now > 10x faster than astropy_fft in float32, and 5x faster in float64!

The math of the trick makes perfect sense too: the C2C FFT does $\sum_j c_j e^{ikx_j}$ for $k \in \{-N/2, ..., N/2-1 \}$, so multiplying $c_j$ by $e^{ix_jN/2}$ effectively shifts the frequency range to $k \in \{0, ..., N-1 \}$, which is exactly what we want.

I am now convinced that the errors O(N.epsmach) are inevitable. They are a property of the problem specification, due to the inevitable O(epsmach) rounding error on the input NU point locations x_j. Thus no algorithm can do better, even in principle. This is no more mysterious than the condition number (w.r.t. x) of evaluating exp(ikx), for fixed k huge. The sensitivity cannot be less than O(k). I will add this to the docs.

Thus I would be suspicious of your errors beating 1e-3 for N=1e7 in float32. There is no way to get less than around 1e-1 relative error on the k=5e6 output in float32. Your reference may be too similar to your own tested method...

Thanks so much for following up on this! This is very relevant to the GPU version that I'm working on—cufinufft is faster in most situations, although it's essentially limited to float64 by the precision issues you point out. In some cases, it can be faster to use a non-nufft approach in float32 than cufinufft in float64, and such non-nufft approaches don't seem to suffer the float32 precision issues to the same degree.

I did leave out an important piece of information from the previous error plots, which was that I was using Nf=10**4 (I hadn't fully appreciated the error scaling). So I think that answers why the finufft error was beating 1e-3? Or were you referring to a different error?

Here's a stronger test with Nf=10**6. The "shape" of the plot has changed now that I'm appropriately rescaling the inputs for R2C, and we see the much larger finufft errors that you were expecting:

The baseline64 reference is computed by promoting the float32 inputs to float64, then doing naive sincos(f*t) evaluations.

Expect more on CUDA soon!

and such non-nufft approaches don't seem to suffer the float32 precision issues to the same degree.

Any approach that computes the same quantity must suffer the same float32 precision issues (That's the point of my new doc :).

It could conceivably be that the LS output is not sensitive to the NU sample locations, only the NUFFT route to computing LS, in which case somehow the NUFFT is an unstable way to get to LS. I don't recall the LS definition, but I highly doubt it. That could be tested: just add random jitter size eps to the x_j locations and measure the LS output change. Does that ratio grow with Nf?

I did leave out an important piece of information from the previous error plots, which was that I was using Nf=10**4 (I hadn't fully appreciated the error scaling). So I think that answers why the finufft error was beating 1e-3? Or were you referring to a different error?

I was referring to the impossibility of your baseline getting <1e-3. This still holds for your Nf=1e6 graph. There is no way any method can get more than about 1 digit for Nf=1e6 in single precision.

All of this is simply due to the fact that your sincos(k*x) cannot in principle be more accurate than about 1e-1 if x is a float32 and k = 1e6, say. This applies to NUFFT, phase winding, and to baseline.

Here's a stronger test with Nf=10**6. The "shape" of the plot has changed now that I'm appropriately rescaling the inputs for R2C, and we see the much larger finufft errors that you were expecting:

Looks like the small freqs have large errors due to the Nf/4 shift trick to half the NUFFT size. That is actually less accurate than the instrinsic condition # of the problem allows for small k. Something to be aware of, if you want more acc at small k.

The baseline64 reference is computed by promoting the float32 inputs to float64, then doing naive sincos(f*t) evaluations.

Well, promoting the inputs isn't fair, since it creates "accuracy" where there is no right to be any :) Baseline can't cheat and use float64. The user data doesn't contain that accuracy.

Ponder the above - and the concept of condition number of the (LS) problem - and you can make the tests more fair.

Thanks for the write-up on the condition number, it inspired me to do some more thinking and tests. I think the issue, or at least part of it, is that finufft seems to not be reaching the condition number of the problem in the area of parameter space this test is exploring. Specifically, the previous test is using df=1e-5, so I think the relevant f for the condition number is actually 10, not 10^6. I wrote up a notebook testing this, purely looking at NUFFTs and not LS: https://gist.github.com/lgarrison/5e86646ef0b32a554af4ddc9b324055a

However, I think ultimately this area of parameter space is not that useful. I was originally using a very small df to keep the arguments to sin/cos small so they wouldn't end up on a slow evaluation path, but real applications are probably going to use larger values. And I totally agree that the condition number in most applications is going to be very restrictive for 32-bit. I think my plan going forward will be to focus on the 64-bit version.

dfm / nufft-ls