Closed sampsyo closed 3 years ago
I merged master (https://github.com/cucapra/futil/commits/master) into ntt-new
and it produces the right answers. I diff
d the outputs from nttstuff and the branch.
Oh wow, that's awesome!! I'm glad this is somehow working—I wonder what bug fix along the way is to blame.
The next step in this line of work, I suppose, is then to try to getting an actually good NTT implementation working to complement this very very slow one.
Perhaps we can re-open this, or make a new issue? There's a lot of good information in here, and I see myself poking around at this when I have time.
Not sure I follow @cgyurgyik. The issue has been fixed already.
I think what @cgyurgyik is getting at is that the title may be oversimplifying what is in the long issue body I wrote, which sort of lays out a map for a larger "make FuTIL into a great substrate for NTT" would look like instead of "fix my bug in this code."
OK, so this is way too long in coming, but here are my notes on what I have learned about NTT. For @yoonachang!
Background
The number theoretic transform (NTT) is a variant of the fast Fourier transform (FFT) that works on integers and is, we are told, central to the implementation of modern fully homomorphic encryption (FHE) schemes. We want to explore efficient FuTIL-generated NTT implementations as a first step toward doing all manner of cool FHE hardware stuff.
In trying to figure out what the NTT does, I wrote nttstuff, which is just some Python infrastructure around a simple reference implementation (and an invocation of another reference implementation from the sympy library). The basic (unoptimized) algorithm looks like this:
where
inp
is an array of input integers to be transformed,ret
is the transformed result array of integers, andN
is the length of the input (and the output). The interesting arguments areP
andomegas
, which are important parameters to the NTT that bear further consideration.P
is some prime number equal to $kN+1$ for some $k$. Dirichlet's theorem apparently tells us that some such prime number must exist, for some $k$, for anyN
.g
, which is something called a "generator" or a "primitive root of unity" for the primeP
. These are some fascinating number theory concepts that I admit I do not really understand at all. But anyway, some number theory stuff tells us that something called a "primitive root of unity" must exist for our primeP
.N
is our aforementioned input size and $k$ is the $k$ from above we used to find the primeP
.omegas
based on ω. This is basically a pre-computation step to save time: it's possible to compute this stuff on the fly, but it's faster to pre-computeN
values based on ω and to use those during the actual NTT computation. Anyway, you can see mygen_omegas
function in nttstuff for how this array is computed.So that's everything that goes into this simple NTT implementation! With all that table-setting, it's possible to transform a given input array of
N
elements.Algorithms
There are many strategies for implementing the NTT. To learn about efficient implementations, it is generally more helpful to google for information about FFT algorithms, which are way more popular but translate fairly directly to the NTT variant. For example, the Cooley–Tukey FFT is probably the most popular efficient algorithm; one can translate the same strategy to NTT.
Shunning has put together an incredibly helpful compendium of NTT implementations. For example, there is a naive 2-nested loop, a Cooley–Tukey, a Gentleman–Sande, and corresponding inverse implementations.
Current Status & Next Steps
In #256, I have started an implementation of the naive NTT (and an input dataset) with the very small size N=32. I think it's correct but something is wrong—the output is incorrect when you run it in FuTIL. The first step here is to figure out why and make the naive implementation work. Then we can move on to implementing efficient versions.
You can run the (incorrect) FuTIL accelerator like this (on the ntt-new branch):
The output array,
ret0
, contains just a bunch of 7s. To get the right answer, use my nttstuff, like this:For which, suspiciously, the first output element is 7, but the rest is (of course) not. So it seems like the bug is about somehow preserving the first output value forever, spreading it across the entire output array?