dfm / nufft-ls

Benchmarking LS periodogram implementations
Apache License 2.0
4 stars 0 forks source link

Phase winding? #4

Open dfm opened 2 years ago

dfm commented 2 years ago

Over in #2, @ahbarnett says:

But this reminds me of a hugely important fact: Recall that since their freq modes are equispaced then you do not need to do sincos for each! The addition formula should be used, turning it all into adds&mults.

Eg check my code for summing a Fourier series here (where it's called "phase winding" and done w/ complex #s): https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/src/finufft.cpp#L203

I imagine the 1tera-pair/sec they report on GPU is using this trick.

A few comments on this:

  1. I'm not sure that I totally understand the trick. Perhaps one of you (@lgarrison, @ahbarnett) could explain it to me?
  2. Gowanlock+ don't seem to use any tricks like this. They say that their implementation is directly ported from scipy, which is implemented here: https://github.com/scipy/scipy/blob/b9881c85530f42c19c97ed38c052af8100caf250/scipy/signal/_spectral.pyx#L20-L97, it looks pretty much identical to our baseline, unless I'm missing something.
lgarrison commented 2 years ago

I haven't sat down to try to understand the trick yet, but I did notice that the Gowanlock+ paper appears to address it (last paragraph of Section 4.5, page 8):

The sequential LSP algorithm iterates over frequen- cies, which means that the calculation of sin and cos can be simply computed using the ∆f between frequency n and n + 1, thus eliminating costly sin and cos calcula- tions (i.e., the sincos at iteration n + 1 in the loop in Listing 1 could be expressed as the difference between the sincos at iteration n). Unfortunately, this optimization eliminates the possibility of parallelization, as it introduces inter-iteration dependencies. Therefore, we do not employ this optimization in lsp-c, as we would not be able to execute the algorithm in parallel.

lgarrison commented 2 years ago

I think I understand the trick now, which is an application of the angle addition formula. If we want to compute sin(f + df), we have:

sin(f + df) = sin(f) cos(df) + cos(f) sin(df).

(sin|cos)(df) are constant, and (sin|cos)(f) can just come from the previous iteration. df doesn't have to be small; it's just the frequency grid spacing.

So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos call, but then the subsequent 9 iterations use the angle-addition formula.

We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup!

ahbarnett commented 2 years ago

HI Lehman, Re addition formula - bingo. (2x2 matrix mult is another way to see it. Complex arith is not needed). There is one "phase" (cos(df),sin(df) pair) needed to be stored per NU pt.

Nice to see you come up with this hybrid, which is exactly what I was thinking when I read the "does not parallelize" claim in your previous msg :) It will destroy the cost of sincos and be as parallel as you want. I suggest you make each thread do more than 10... more like 1e2-1e3? Or (Nmodes * NUpts) / Nthreads, which would be the ideal, right?

But since you can parallelize over NU pts (of which they have ~4k), their "does not par" claim makes little sense to me.

A OpenMP CPU test first would be great. How soon could you get a GPU version going?

However, before we get too excited re GPU beating theirs, I first suggest figuring the rate at which a simple FMA can beat a sincos... it might be that making that swap does not actually speed up the GPU that much (ie can a GPU do 1Teraflop of plain FMAs, which is the pair rate for sincos they see to claim?)

CHeers, Alex

On Mon, Feb 7, 2022 at 3:46 PM Lehman Garrison @.***> wrote:

I think I understand the trick now, which is an application of the angle addition formula. If we want to compute sin(f + df), we have:

sin(f + df) = sin(f) cos(df) + cos(f) sin(df).

(sin|cos)(df) are constant, and (sin|cos)(f) can just come from the previous iteration. df doesn't have to be small; it's just the frequency grid spacing.

So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos call, but then the subsequent 9 iterations use the angle-addition formula.

We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup!

— Reply to this email directly, view it on GitHub https://github.com/dfm/nufft-ls/issues/4#issuecomment-1031904400, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSS364B6SZC26BDJANDU2AVTVANCNFSM5NXZWUKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942