Understand trade-offs of multi-radix domains and FFTs

Pratyush commented 4 years ago

The current integration of mixed-radix FFTs is kind of unsatisfactory. There are two issues:

The current impl only allows selecting one additional basis for the radix, which might be insufficient for applications (see https://github.com/arkworks-rs/algebra/issues/731)
Even after fixing the above, there's still the issue that we only select the larger radix after exhausting all powers in radix-2. As suggested in https://github.com/scipr-lab/zexe/issues/156#issuecomment-716432571, this can result in poor "resizing behaviour", which is demonstrated via the following example:
```
Required size: 10;
GeneralEvaluationDomain: domain_size = 2^4 = 16
DynamicEvaluationDomain: domain_size = 2 * 5 = 10
```
To resolve this, we might want to implement a "dynamic" domain that automatically chooses a subgroup whose size matches the required input size as closely as possible. This is different from the behaviour of something like GeneralEvaluationDomain, which tries to fit things into radix-2 if possible, and only goes for larger domain sizes if radix-2 is insufficient. This would also unify the existing Radix2Domain, MixedRadixDomain, and GeneralEvaluationDomain impls.

cc @weikengchen @paberr

ValarDragon commented 4 years ago

A relevant data point is how FFTW does this. They have the caller create a "plan" for FFTs based on the domain size, which then AFAIU does precompilation and preprocessing for the exact domains and FFT plan they end up using. http://www.fftw.org/fftw-paper.pdf

paberr commented 4 years ago

That sounds like a good idea to me.

Even after fixing the above, there's still the issue that we only select the larger radix after exhausting all powers in radix-2. As suggested in scipr-lab/zexe#156 (comment), this can result in poor "resizing behaviour", which is demonstrated via the following example:
Required size: 10;
GeneralEvaluationDomain: domain_size = 2^4 = 16
DynamicEvaluationDomain: domain_size = 2 * 5 = 10

One question regarding this part: I would imagine that a pure radix-2 FFT is more efficient than a mixed-radix one. So, when the achievable domain sizes are close, a slightly larger radix-2 FFT might be preferable to a mixed-radix one, no? I haven't had a look at the FFTW paper yet, but perhaps they do have some insights into the trade-offs here.

UlrichHaboeck75 commented 4 years ago

My five cents on the smoothness of the FFT domain and its approximation goodness for target sizes x from a given interval I= [x_min, x_max].

Given an FFT domain of mixed order n = p_1^e_1 * ... * p_k^e_k, with p_i being the prime factors and e_i their exponents, and let d(x,L) = min_{y in L} y/x the minimal 'distance' of x by numbers from the discrete set L of all possible subdomain sizes y = p_ 1^{y_ 1}*...*p_k^{y_k} with y_i from {0,1,...,e_i}. In our case the number of such combinations is small enough to find the best approximation via enumeration, and even plots of d(x,L) varying over x are easily produced to get an impression for the uniform approximation goodness

delta = max_{x in I} d(x,L).

In general given I one has to oversize the FFT domain to achieve a reasonable small approximation goodness (corresponding to a relative error < 5%, say) and the more prime factors of n the better. However, having too large p_i increases the computational complexity of the mixed FFT, which is

T(n) = n * Sum_i e_i (pi - 1) (M+A),

where M and A is for field multiplication and addition, respectively. In practice the increase is acceptable (typically a factor of 2 relative to the duadic case, see the table below) and is outweighted by the benefit of small delta.

In connection with the task of searching suitable BLS24 parameter (with group sizes bits of more than 256) I executed a script over a slightly larger set than the one of Pratyush (starting from parameter x = 2^32). I filtered out those cases which allow sufficiently smooth FFT domains in both the base field as well the scalar field (having both log2(n_r) and log(n_q) >= 30 bit and prime factors <= 17). I further computed the goodnesses delta_r and delta_q for the interval I=[2^10, 2^25] using an explicit algorithm to solve the discrete max-min problem, and determined the performances T_r , T_q of their FFT relative to the equivalent duadic case T(duadic)= T(2^log n) using the above formula. These are the results:

x	bits	`log2(n_r)`	`log2(n_q)`	`delta_r`	`delta_q`	`T(n_r)/T(duadic)`	`T(n_q)/T(duadic)`
-4346495999	257	41.8051	32.0172	1.00689	1.02262	2.00932	1.68659
-4505419775	257	43.3203	31.2246	1.02218	1.02602	1.66204	1.98561
-4518481967	257	30.9689	31.2908	1.02354	1.02963	1.74369	1.78966
4796508718	258	32.5132	31.1912	1.01425	1.02354	1.84541	1.82744
-4975851464	258	36.664	30.6266	1.0104	1.05488	1.85468	2.12234
-5032487474	258	48.4004	34.2411	1.01089	1.02001	2.0661	1.51864
-5131697570	259	39.3547	30.067	1.00782	1.04167	2.28689	2.46117
-5155487999	259	34.2635	32.2635	1.0111	1.02262	1.80951	1.85969
5372915626	259	41.4331	32.8062	1.01169	1.0242	2.26872	2.31663
-5482229024	259	49.8397	31.5071	1.01705	1.03822	2.00643	2.25346
-5712214364	260	32.277	32.1718	1.02262	1.02392	1.85891	2.33123

Pratyush commented 4 years ago

Thanks for the detailed analysis! It's a good point that we can easily compute the best fit for a given problem size. If I'm understanding correctly, the take away is that FFTs are at most 2x slower, right?

As an aside, there is a slight downside to consider fields of size > 255 bits: you have to add an another limb to all calculations on the scalar field, which includes constraint generation, witness generation, FFTs, etc. I don't know of a way to simplify this, beyond making a different BigInteger type (for example: BigInt264([u64; 4], u8), and even then, I don't know if it would make a concrete difference due to alignment issues). (I'm assuming that these curves still have base field < 320 bits, because otherwise one would lose the benefit of using BLS24 curves)

UlrichHaboeck75 commented 4 years ago

Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.

About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.

UlrichHaboeck75 commented 4 years ago

Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.

About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.

first results from the running script: Screenshot from 2020-11-04 13-11-02

In particular the last with smoothness bound 11 for both domains is interesting. It's approximation goodness comes with an rel. error < 1.7%, and the increase of FFT performance is only about 1.5.

arkworks-rs / algebra

Understand trade-offs of multi-radix domains and FFTs #24