Make the NTT example actually produce the right answer

sampsyo commented 4 years ago

OK, so this is way too long in coming, but here are my notes on what I have learned about NTT. For @yoonachang!

Background

The number theoretic transform (NTT) is a variant of the fast Fourier transform (FFT) that works on integers and is, we are told, central to the implementation of modern fully homomorphic encryption (FHE) schemes. We want to explore efficient FuTIL-generated NTT implementations as a first step toward doing all manner of cool FHE hardware stuff.

In trying to figure out what the NTT does, I wrote nttstuff, which is just some Python infrastructure around a simple reference implementation (and an invocation of another reference implementation from the sympy library). The basic (unoptimized) algorithm looks like this:

for i in range(N):
    for j in range(N):
        ret[i] = (ret[i] + inp[j] * omegas[(i * j) % N]) % P

where inp is an array of input integers to be transformed, ret is the transformed result array of integers, and N is the length of the input (and the output). The interesting arguments are P and omegas, which are important parameters to the NTT that bear further consideration.

P is some prime number equal to $kN+1$ for some $k$. Dirichlet's theorem apparently tells us that some such prime number must exist, for some $k$, for any N.
Next, we need a parameter called ω.
- To get this, we first get g, which is something called a "generator" or a "primitive root of unity" for the prime P. These are some fascinating number theory concepts that I admit I do not really understand at all. But anyway, some number theory stuff tells us that something called a "primitive root of unity" must exist for our prime P.
- Then, our parameter ω is defined as $g^k mod N$, where N is our aforementioned input size and $k$ is the $k$ from above we used to find the prime P.
Finally, we want to generate an entire array called omegas based on ω. This is basically a pre-computation step to save time: it's possible to compute this stuff on the fly, but it's faster to pre-compute N values based on ω and to use those during the actual NTT computation. Anyway, you can see my gen_omegas function in nttstuff for how this array is computed.

So that's everything that goes into this simple NTT implementation! With all that table-setting, it's possible to transform a given input array of N elements.

Algorithms

There are many strategies for implementing the NTT. To learn about efficient implementations, it is generally more helpful to google for information about FFT algorithms, which are way more popular but translate fairly directly to the NTT variant. For example, the Cooley–Tukey FFT is probably the most popular efficient algorithm; one can translate the same strategy to NTT.

Shunning has put together an incredibly helpful compendium of NTT implementations. For example, there is a naive 2-nested loop, a Cooley–Tukey, a Gentleman–Sande, and corresponding inverse implementations.

Current Status & Next Steps

In #256, I have started an implementation of the naive NTT (and an input dataset) with the very small size N=32. I think it's correct but something is wrong—the output is incorrect when you run it in FuTIL. The first step here is to figure out why and make the naive implementation work. Then we can move on to implementing efficient versions.

You can run the (incorrect) FuTIL accelerator like this (on the ntt-new branch):

$ fud exec examples/dahlia/ntt-naive.fuse --to dat -s verilog.data examples/data/ntt.data

The output array, ret0, contains just a bunch of 7s. To get the right answer, use my nttstuff, like this:

$ python nttstuff.py -doi ../futil/examples/data/ntt.data

For which, suspiciously, the first output element is 7, but the rest is (of course) not. So it seems like the bug is about somehow preserving the first output value forever, spreading it across the entire output array?

rachitnigam commented 3 years ago

I merged master (https://github.com/cucapra/futil/commits/master) into ntt-new and it produces the right answers. I diffd the outputs from nttstuff and the branch.

sampsyo commented 3 years ago

Oh wow, that's awesome!! I'm glad this is somehow working—I wonder what bug fix along the way is to blame.

The next step in this line of work, I suppose, is then to try to getting an actually good NTT implementation working to complement this very very slow one.

cgyurgyik commented 3 years ago

Perhaps we can re-open this, or make a new issue? There's a lot of good information in here, and I see myself poking around at this when I have time.

rachitnigam commented 3 years ago

Not sure I follow @cgyurgyik. The issue has been fixed already.

sampsyo commented 3 years ago

I think what @cgyurgyik is getting at is that the title may be oversimplifying what is in the long issue body I wrote, which sort of lays out a map for a larger "make FuTIL into a great substrate for NTT" would look like instead of "fix my bug in this code."

calyxir / calyx