CUDA and cufinufft implementations

CC: @dfm, @ahbarnett

Background

In tandem with the CPU implementations in #11, I've been looking at GPU acceleration of Lomb-Scargle, either with native CUDA or with cufinufft. Indeed, the original motivation for this project was the idea that we could do phase-winding (#4) on the GPU, essentially eliminating the trig function cost and reducing the whole problem to FMA, which GPUs are very good at.

Implementation

Unfortunately, what I've found is that phase-winding doesn't adapt very well to the GPU. The cause seems to be L1 cache exhaustion. To recap the winding idea: each thread loops over all the data points and accumulates coefficients in a sequence of frequency bins, rather than 1 bin. But with 4 coefficients per bin and (up to) 2048 threads per SM, that's already 64K of cache (double precision) per bin. Adding more bins per thread very quickly exhausts cache and completely tanks the performance (the cache spilling can be seen clearly in the ncu profiler).

We were hoping to achieve 100-1000+ bins per thread, but I haven't come up with a way to use the winding idea with more than 4-16, max (depends on lots of arch-specific details like the L1 cache size and kernel occupancy). Sometimes it's faster than a brute-force approach, but usually it's not. Notably, the winding kernel with 1 bin-per-thread is slower than the brute-force kernel, even though they're doing the same amount of math, which indicates the compiler is finding some efficiencies in the simplicity of the brute-force approach.

Regardless of the CUDA performance, in very few cases are any of the implementations faster than cufinufft!

Timings

Here's an example from an A100 GPU. Winding is slower than brute-force here; I've only seen it faster on cards with emulated double-precision (consumer and workstation cards, mostly), which makes sense as winding reduces expensive trig. I've included CPU finufft for scale; there is indeed a big advantage in going to the GPU!

The timings for cuda are nearly a faster of 2 faster than Gowanlock et al., as expected based on relative GPU speeds.

Going to float32 doubles the speed of cuda and more than doubles it for cuda_winding (because a larger window of freq bins fits in L1), but cufinufft is still faster above Nf=10^5. So if we wanted to optimize for the case of small Nf, then a brute-force CUDA implementation in float32 might be worthwhile, but we'd have to be careful to limit it to small Nf due to the challenging condition number of the problem. I think it's about a factor of 2 speedup, so not nothing, but maybe not worth the complexity.

Next Steps

At small Nf, it's clear that cuda and cufinufft are both not saturating the GPU. So I do still want to check that the overall picture doesn't change when computing multiple periodograms. This may require implementing something like Gowanlock+'s batch mode (one periodogram per SM), or maybe it's enough to just launch many kernels (most devices support up to 128). I'll have to play around with it.

In terms of packaging and distribution, a single-object, cufinufft-based version wouldn't require any new compiled components. A multi-object version might require a bridge code to launch kernels on multiple streams (unless cupy can do this efficiently). Or cufinufft could be adapted upstream to accept multiple sets of NU points. Or we limit concurrency to the same-NU-points case! Lots to think about.

dfm / nufft-ls