Benchmarking N^2 CPU implementation

lgarrison commented 2 years ago

Here's some initial work on benchmarking the CPU periodogram! I decided to hook up the C++ code via pybind11 to make running benchmarks easier; the main entry point is bench/bench.py. The core C++ computation remains in include/periodogram.cpp.

My main finding so far is that the results depend on the compiler stack to an astonishing degree. Problem size and precision also affect the timing, in somewhat surprising ways. Here's just one example, using a similar config to the LS GPU paper we looked at:

`N=3554`, `M=10**5`, Skylake, 1 core

	Astropy	GCC	Clang	ICC
`float64`	9.0 s	29 s	34 s	890 ms
`float32`	16 s	6.2 s	5.7 s	2.1 sec

I've investigated why ICC is so fast, and it's almost entirely due to its use of SVML and its fast sincos. I was also able to get it to run on AMD Rome nodes, so using SVML doesn't totally lock us into Intel CPUs. It's also possible to get Clang and GCC to invoke SVML, in theory.

As to why GCC and Clang are slower than Astropy, looking at the assembly reveals that GCC is falling back to a scalar sincos, maybe due to this bug. Clang does also, but I don't know why! I played around with compiler flags to no avail, but I'm really not a Clang expert. Astropy probably benefits from Numpy using vectorized sin and cos implementations under the hood.

Astropy and ICC both are slower for float32 than float64, and not by a small amount. Go figure! Also, the float32 has some precision issues, but I think this is partly because this is a random signal with a lot of near-zero power.

So, there are clearly still lots of threads still to follow up on! But the next step is probably to port this N^2 version to the GPU so we can compare with a NUFFT GPU version. I think we also talked about comparing the N^2 CPU version with a NUFFT CPU version. It may be beneficial to consider CPU multithreading at that point.

@ahbarnett, @MelodyShih for visibility

dfm commented 2 years ago

Very cool! I'm extremely surprised by the fact that the astropy implementation is so much faster so it might be worth investigating some more? Are you sure that numpy isn't doing any parallelization? I'm not familiar with that threadpoolctl approach I've normally used openmp environment variables for this.

lgarrison commented 2 years ago

I don't think Numpy can be parallelizing, for two reasons. First is that I'm doing the benchmarks with taskset -c 0 ./bench/bench.py, so only one core is available. And second is that threadpoolctl works by calling omp_set_num_threads(1); likewise for MKL, BLAS, or any other libraries it sees are loaded (that's why I use threadpoolctl; Numpy might be using an MKL backend instead of OpenMP!).

I'll see if I can figure out exactly why Astropy is faster than GCC/Clang in double. I suspect it's because unvectorized, double-precision trig functions can be very costly to evaluate (more than a factor of 2 over their single-precision counterparts). I'll run some more tests...

dfm commented 2 years ago

Fascinating! I'm impressed that that makes such a difference!

lgarrison commented 2 years ago

Figured out why Astropy is so fast. In Numpy 1.22, SVML was introduced for trig functions, with 10x speedups for double-precision sin/cos: https://numpy.org/doc/stable/release/1.22.0-notes.html#vectorize-umath-module-using-avx-512

I confirm that Numpy 1.21.5 runs the above benchmark in 57 seconds, compared with 9 seconds on 1.22. The 57 seconds is much slower than GCC and Clang, both of which do not use SVML but likely have other advantages, like being able to use sincos (which Numpy cannot).

SVML doesn't seem to affect the single-precision Astropy performance. Assuming SVML doesn't help single precision sin/cos explains why both Astropy and ICC are slower in single-precision than double (although it would be nice to understand why SVML can't/won't do single-precision sin/cos!).

dfm commented 2 years ago

Awesome! This is great.

ahbarnett commented 2 years ago

Dear Lehman & Dan, This is a great start. SVML should be 2x faster for float32 vs float64, naively, since it uses SIMD... so that is backward with the ICC. For reference to compare single-core float32 FINUFFT for a single type-1 of same size:

(base) @.*** /home/alex/numerics/finufft> test/finufft1d_testf 3554 1e5 1e-4 test 1d type 1: 100000 NU pts to 3554 modes in 0.0086 s 1.16e+07 NU pts/s one mode: rel err in F[1314] is 3.08e-05

So, 100x faster :)

(But I forgot if LSP needs two xforms..>) Please check if it's 3554 modes and 1e5 NU pts. (not vice versa!)

I would be interested to see how your timings compare w/ my conclusion that (cu)FINUFFT would not give much speed up when one of the dimensions is "only" O(1e3)...

Best, Alex

On Wed, Feb 2, 2022 at 4:13 PM Lehman Garrison @.***> wrote:

Merged #2 https://github.com/dfm/nufft-ls/pull/2 into main.

— Reply to this email directly, view it on GitHub https://github.com/dfm/nufft-ls/pull/2#event-5998675911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSSBQ23BDUCJQ2O3MMTUZGM55ANCNFSM5NMXPXQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

ahbarnett commented 2 years ago

Just for reference I dug up old email from Dec-13-2021 where I translate their GPU speeds:

Really their innovation is "batch mode", which does ~1000 LSP's with different NU pts, strengths (they call this N_o, and the most precise statement they make is that N_o ~ 1000). So, let's say N_o=1000.They use 2016-era Pascal-based GPUs.For the GP100, they do N_o double-prec LSPs from N_t = 3554 NU pts (only!) to N_f = 1e6 U freqs in 4-5 sec (reading Fig. 10 lower). This is about 1 tera-point-pair/sec. Quite good! (I'm assuming they use the addition formula recurrence for {sin,cos} ?)

So, this would be 0.005 sec for 3554 to 1e6 pts, or 0.0005 for 3554 to 1e5 pts which is your test, and around 2000x faster.

But this reminds me of a hugely important fact: Recall that since their freq modes are equispaced then you do not need to do sincos for each! The addition formula should be used, turning it all into adds&mults.

Eg check my code for summing a Fourier series here (where it's called "phase winding" and done w/ complex #s): https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/src/finufft.cpp#L203

I imagine the 1tera-pair/sec they report on GPU is using this trick.

Could you implement it on the CPU, and report that? Thanks! Alex

On Thu, Feb 3, 2022 at 1:23 PM Alex Barnett @.***> wrote:

Dear Lehman & Dan, This is a great start. SVML should be 2x faster for float32 vs float64, naively, since it uses SIMD... so that is backward with the ICC. For reference to compare single-core float32 FINUFFT for a single type-1 of same size:

(base) @.*** /home/alex/numerics/finufft> test/finufft1d_testf 3554 1e5 1e-4 test 1d type 1: 100000 NU pts to 3554 modes in 0.0086 s 1.16e+07 NU pts/s one mode: rel err in F[1314] is 3.08e-05

So, 100x faster :)

(But I forgot if LSP needs two xforms..>) Please check if it's 3554 modes and 1e5 NU pts. (not vice versa!)

I would be interested to see how your timings compare w/ my conclusion that (cu)FINUFFT would not give much speed up when one of the dimensions is "only" O(1e3)...

Best, Alex

On Wed, Feb 2, 2022 at 4:13 PM Lehman Garrison @.***> wrote:

Merged #2 https://github.com/dfm/nufft-ls/pull/2 into main.

— Reply to this email directly, view it on GitHub https://github.com/dfm/nufft-ls/pull/2#event-5998675911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSSBQ23BDUCJQ2O3MMTUZGM55ANCNFSM5NMXPXQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

*---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

ahbarnett commented 2 years ago

PS I forgot to say, please ask if you don't know what I'm talking about :)

On Thu, Feb 3, 2022 at 6:27 PM Alex Barnett @.***> wrote:

Just for reference I dug up old email from Dec-13-2021 where I translate their GPU speeds:

Really their innovation is "batch mode", which does ~1000 LSP's with different NU pts, strengths (they call this N_o, and the most precise statement they make is that N_o ~ 1000). So, let's say N_o=1000.They use 2016-era Pascal-based GPUs.For the GP100, they do N_o double-prec LSPs from N_t = 3554 NU pts (only!) to N_f = 1e6 U freqs in 4-5 sec (reading Fig. 10 lower). This is about 1 tera-point-pair/sec. Quite good! (I'm assuming they use the addition formula recurrence for {sin,cos} ?)

So, this would be 0.005 sec for 3554 to 1e6 pts, or 0.0005 for 3554 to 1e5 pts which is your test, and around 2000x faster.

But this reminds me of a hugely important fact: Recall that since their freq modes are equispaced then you do not need to do sincos for each! The addition formula should be used, turning it all into adds&mults.

Eg check my code for summing a Fourier series here (where it's called "phase winding" and done w/ complex #s):

https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/src/finufft.cpp#L203

I imagine the 1tera-pair/sec they report on GPU is using this trick.

Could you implement it on the CPU, and report that? Thanks! Alex

On Thu, Feb 3, 2022 at 1:23 PM Alex Barnett < @.***> wrote:

Dear Lehman & Dan, This is a great start. SVML should be 2x faster for float32 vs float64, naively, since it uses SIMD... so that is backward with the ICC. For reference to compare single-core float32 FINUFFT for a single type-1 of same size:

(base) @.*** /home/alex/numerics/finufft> test/finufft1d_testf 3554 1e5 1e-4 test 1d type 1: 100000 NU pts to 3554 modes in 0.0086 s 1.16e+07 NU pts/s one mode: rel err in F[1314] is 3.08e-05

So, 100x faster :)

(But I forgot if LSP needs two xforms..>) Please check if it's 3554 modes and 1e5 NU pts. (not vice versa!)

I would be interested to see how your timings compare w/ my conclusion that (cu)FINUFFT would not give much speed up when one of the dimensions is "only" O(1e3)...

Best, Alex

On Wed, Feb 2, 2022 at 4:13 PM Lehman Garrison @.***> wrote:

Merged #2 https://github.com/dfm/nufft-ls/pull/2 into main.

— Reply to this email directly, view it on GitHub https://github.com/dfm/nufft-ls/pull/2#event-5998675911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSSBQ23BDUCJQ2O3MMTUZGM55ANCNFSM5NMXPXQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

*---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

--

*---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

lgarrison commented 2 years ago

Thanks, @ahbarnett! This is great, I'll see if I can implement your version without sincos. I'll let you know if I run into any problems.

The test I ran was indeed with 3554 NUPTS and 10^5 modes, so the reverse of what you ran. This is the "single object" dataset used in the LSP paper (of course, they tried many N_f; I just picked one). Did you use multiple threads, by the way?

And I agree the ICC result does seem backwards. I'll see if I can figure out why...

dfm / nufft-ls