Initial phase winding implementation

lgarrison commented 2 years ago

Following up on #4, here's an initial implementation of the phase-winding idea on the CPU! The relevant function is compute_winding() in periodogram.hpp.

I'm not sure I've nailed the best implementation; I ended up needing 2 pairs per NU point, instead of the 1 pair @ahbarnett suggested.

The timings are better though:

`N=3554`, `M=10**5`, Skylake, ICC, 1 core

	Astropy	Baseline	Winding
`float64`	8.9 s	890 ms	310 ms
`float32`	4.8 s	420 ms	160 ms

./bench/bench.py will benchmark the baseline, winding baseline, and astropy versions (--help shows options), and checks that they get the same answer.

I thought about transposing the M and N loops to get to 1 pair per NU point, but it seems like that would require several M-length accmulator arrays, so I wasn't sure if that was better. I still might be missing something obvious though...

lgarrison commented 2 years ago

I had a flash of insight on a totally non-winding related question, which is why the float32 baseline is slower than the float64 baseline. sin and cos performance can be dependent on the absolute magnitude of the argument, sometimes referred to as the "fast path" and "slow path". And the slow-path cutoff for float32 is usually lower than that for float64. I think that's what we're hitting here: when I use df = 1e-4 instead of df = 1e-1, the float32 time goes from 2.3 s to 0.39 s; similar for Astropy. So it's just a question of rescaling the inputs!

I've updated the timings in the above message; it has no impact on the winding time. Now all the float32 times are ~half the float64 times, and all is right with the world.

lgarrison commented 7 months ago

I'm revisiting this after some work on jax-finufft; going to merge this in preparation for more benchmarking/a possible GPU implementation.

dfm / nufft-ls