calyxir / calyx

Intermediate Language (IL) for Hardware Accelerator Generators
https://calyxir.org
MIT License
453 stars 45 forks source link

[NTT] Explore faster NTT implementations. #302

Closed cgyurgyik closed 3 years ago

cgyurgyik commented 3 years ago

A more thorough background is provided by @sampsyo in #277.

sampsyo commented 3 years ago

The next step along those lines would be to try implementing one of the better (non-naive) NTT implementations, such as Cooley–Tukey, using Dahlia. That opens a few interesting possibilities:

cgyurgyik commented 3 years ago

https://github.com/cornell-brg/pymtl3-fft/blob/sj634-poly-dft-xcel/nttlib/arith/ntt_play.py This link doesn't work for me. Am I missing something?

sampsyo commented 3 years ago

Nope—just missing access to @jsn1993's repository. :smiley: If you're interested, we can ask him to give you access! (@yoonachang and I currently do.)

cgyurgyik commented 3 years ago

Sure. I find this stuff pretty interesting, though I confess some of the math behind it is hard for me to digest in its entirety.

https://github.com/sampsyo/nttstuff/pull/1 In the PR above, I've added another algorithm (cooley-tukey Radix-2).

Perhaps there's better/faster that we want to look at pushing into FuTIL?

One note with the radix-2 algorithm, it requires variable loop bounds, e.g.

for (let: i = 0..M) { ... }
...
M = M * 2;

I'm not sure if variable loop bounds aren't supported for a reason in Dahlia. Is this a design decision, or something we just haven't gotten around to yet?

group cond {
  lt.left = counter.out;
  lt.right = 32'd1; // Couldn't this be a register value as well?
  cond[done] = 1'd1;
}
sampsyo commented 3 years ago

Great. And yes, TBH most of the math is unintelligible to me and I'm reproducing the algorithms "phonetically."

For the radix-2 algorithm's dynamic loop bounds: probably the right thing to do in Dahlia is use a while loop. The reason for loops need static loop bounds is that we want to reason about parallelism in these loops, and more relevantly here to guarantee that a loop has "simple" control hardware, even when unrolled. That's impossible, of course, if we can't prove that the unrolling factor divides the loop bound (i.e., if the bound is dynamic).

So by way of answer also to your question about finding a high-performance algorithm to implement: I think the hard part (after getting a basic/non-parallelized Cooley–Tukey working) is to look into what a good hardware implementation of Cooley–Tukey usually entails. There are these things called "butterfly units," for example. I don't know if they will be easy/possible to express in Dahlia. Hence the possible need for either extensions to Dahlia or bypassing Dahlia altogether in favor of generating Calyx directly.

cgyurgyik commented 3 years ago

Currently I don't think Dahlia while loops lower to FuTIL. It sounds like the next step is to take the Radix-2 DIT algorithm and write it in Dahlia in any case, so I'll focus on that.

Ok that makes more sense. I wasn't aware of the butterfly units. I did some wikipedia reading just now.

For a prime number P and array of twiddle factors omegas, we have:

y0 = (x0 + x1 * omegas[g]) % P
y1 = (x0 - x1 * omegas[g]) % P

Which, at the FuTIL level might look something like:

x0 = prim std_reg(32);
x1 = prim std_reg(32);
y0 = prim std_reg(32);
y1 = prim std_reg(32);  
add = prim std_add(32);
sub = prim std_subtract(32);
...
add.left = x0.in;
add.right = x1.in; // Still need to multiply by omegas[g].
y0.in = add.out;   // Still need to apply modulus P.

sub.left = x0.in;
sub.right = x1.in;  // Still need to multiply by omegas[g].
y1.in = sub.out;    // Still need to apply modulus P.

Perhaps we should blackbox a butterfly unit so it takes in as input wires to x0, x1, y0, y1, some P, and the necessary twiddle factor omegas[g]? Just kind of throwing out some initial thoughts here, but perhaps you've already got something in mind.

This also has some useful diagrams: DFT and FFT Overview

sampsyo commented 3 years ago

Got it, that's cool! No, I have nothing currently in mind. But I do think it would be fun to first try writing the (slow, serial) Dahlia for the Cooley–Tukey algorithm, and then we can use an empirical approach to iteratively making it better with such optimizations.

rachitnigam commented 3 years ago

Cooley-Tukey implemented in Dahlia in #308.

cgyurgyik commented 3 years ago

So the next step is to write benchmarks for NTT. I've never written benchmarks for FuTIL, so perhaps someone can point me in the right direction.

rachitnigam commented 3 years ago

Not quite sure what you mean by “benchmark for NTT”. What do you want to measure.

If you want to measure the number of cycles used in the two implementations, you can use a jq query on the output of “fud e —to vcd_json”.

See the “.jq” file for th systolic array correctness test.

cgyurgyik commented 3 years ago

Good question. @sampsyo mentioned comparing performance with Shunning's PyMTL implementation, so I think he has something specific in mind.

sampsyo commented 3 years ago

Good point! Cycle count is a good place to start, but the other important metrics are area and maximum frequency (both of which require going through a real synthesis tool, i.e., either Vivado or some ASIC toolchain). But a first "baby step" would be just comparing the cycle counts for the naive and sequential Cooley-Tukey implementations.

As far as comparing with Shunning's implementation, I think the next step would be to email Shunning! We can see if he has his PyMTL implementation available somewhere, and then maybe we can put both his code and your code through the same synthesis toolchains (perhaps Vivado, for starters, since fud already supports it built in).

A semi-parallel thread to explore would be attempting to parallelize the CT implementation, somehow, to reduce its cycle count.

jsn1993 commented 3 years ago

The cycle count of my implementation is quite easy to calculate. For N - size FFT, it takes log2(N) runs and k + N/P cycles for each run to compute the results assuming the SRAMs are already loaded with data. K is the number of pipeline stages in the butterfly unit. P is the number of PEs.

On Wed, Dec 30, 2020 at 3:25 PM Adrian Sampson notifications@github.com wrote:

Good point! Cycle count is a good place to start, but the other important metrics are area and maximum frequency (both of which require going through a real synthesis tool, i.e., either Vivado or some ASIC toolchain). But a first "baby step" would be just comparing the cycle counts for the naive and sequential Cooley-Tukey implementations.

As far as comparing with Shunning's implementation, I think the next step would be to email Shunning! We can see if he has his PyMTL implementation available somewhere, and then maybe we can put both his code and your code through the same synthesis toolchains (perhaps Vivado, for starters, since fud already supports it built in).

A semi-parallel thread to explore would be attempting to parallelize the CT implementation, somehow, to reduce its cycle count.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cucapra/futil/issues/302#issuecomment-752747566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPYGAMAVZQKTDNZJQRXP7LSXOEENANCNFSM4USZJIAA .

--

Shunning Jiang

Ph.D. Candidate

Computer Systems Laboratory

School of Electrical and Computer Engineering

Cornell University

http://www.csl.cornell.edu/~shunning

sampsyo commented 3 years ago

Awesome; thanks @jsn1993!

rachitnigam commented 3 years ago

@cgyurgyik has this been resolved? I remember that we've gotten some benchmarking for NTT stuff already working with the new frontend.

cgyurgyik commented 3 years ago

I believe so. The pipeline set up that we are currently using is as fast as we can get in terms of parallelism.