Open Pratyush opened 4 years ago
A relevant data point is how FFTW does this. They have the caller create a "plan" for FFTs based on the domain size, which then AFAIU does precompilation and preprocessing for the exact domains and FFT plan they end up using. http://www.fftw.org/fftw-paper.pdf
That sounds like a good idea to me.
- Even after fixing the above, there's still the issue that we only select the larger radix after exhausting all powers in radix-2. As suggested in scipr-lab/zexe#156 (comment), this can result in poor "resizing behaviour", which is demonstrated via the following example:
Required size: 10; GeneralEvaluationDomain: domain_size = 2^4 = 16 DynamicEvaluationDomain: domain_size = 2 * 5 = 10
One question regarding this part: I would imagine that a pure radix-2 FFT is more efficient than a mixed-radix one. So, when the achievable domain sizes are close, a slightly larger radix-2 FFT might be preferable to a mixed-radix one, no? I haven't had a look at the FFTW paper yet, but perhaps they do have some insights into the trade-offs here.
My five cents on the smoothness of the FFT domain and its approximation goodness for target sizes x
from a given interval I= [x_min, x_max]
.
Given an FFT domain of mixed order n = p_1^e_1 * ... * p_k^e_k
, with p_i
being the prime factors and e_i
their exponents, and let d(x,L) = min_{y in L} y/x
the minimal 'distance' of x
by numbers from the discrete set L
of all possible subdomain sizes y = p_ 1^{y_ 1}*...*p_k^{y_k}
with y_i
from {0,1,...,e_i}.
In our case the number of such combinations is small enough to find the best approximation via enumeration, and even plots of d(x,L)
varying over x are easily produced to get an impression for the uniform approximation goodness
delta = max_{x in I} d(x,L).
In general given I
one has to oversize the FFT domain to achieve a reasonable small approximation goodness (corresponding to a relative error < 5%, say) and the more prime factors of n
the better. However, having too large p_i
increases the computational complexity of the mixed FFT, which is
T(n) = n * Sum_i e_i (pi - 1) (M+A),
where M
and A
is for field multiplication and addition, respectively. In practice the increase is acceptable (typically a factor of 2 relative to the duadic case, see the table below) and is outweighted by the benefit of small delta
.
In connection with the task of searching suitable BLS24 parameter (with group sizes bits
of more than 256) I executed a script over a slightly larger set than the one of Pratyush (starting from parameter x = 2^32
). I filtered out those cases which allow sufficiently smooth FFT domains in both the base field as well the scalar field (having both log2(n_r)
and log(n_q) >= 30
bit and prime factors <= 17
). I further computed the goodnesses delta_r
and delta_q
for the interval I=[2^10, 2^25]
using an explicit algorithm to solve the discrete max-min problem, and determined the performances T_r
, T_q
of their FFT relative to the equivalent duadic case T(duadic)= T(2^log n)
using the above formula. These are the results:
x | bits | log2(n_r) |
log2(n_q) |
delta_r |
delta_q |
T(n_r)/T(duadic) |
T(n_q)/T(duadic) |
---|---|---|---|---|---|---|---|
-4346495999 | 257 | 41.8051 | 32.0172 | 1.00689 | 1.02262 | 2.00932 | 1.68659 |
-4505419775 | 257 | 43.3203 | 31.2246 | 1.02218 | 1.02602 | 1.66204 | 1.98561 |
-4518481967 | 257 | 30.9689 | 31.2908 | 1.02354 | 1.02963 | 1.74369 | 1.78966 |
4796508718 | 258 | 32.5132 | 31.1912 | 1.01425 | 1.02354 | 1.84541 | 1.82744 |
-4975851464 | 258 | 36.664 | 30.6266 | 1.0104 | 1.05488 | 1.85468 | 2.12234 |
-5032487474 | 258 | 48.4004 | 34.2411 | 1.01089 | 1.02001 | 2.0661 | 1.51864 |
-5131697570 | 259 | 39.3547 | 30.067 | 1.00782 | 1.04167 | 2.28689 | 2.46117 |
-5155487999 | 259 | 34.2635 | 32.2635 | 1.0111 | 1.02262 | 1.80951 | 1.85969 |
5372915626 | 259 | 41.4331 | 32.8062 | 1.01169 | 1.0242 | 2.26872 | 2.31663 |
-5482229024 | 259 | 49.8397 | 31.5071 | 1.01705 | 1.03822 | 2.00643 | 2.25346 |
-5712214364 | 260 | 32.277 | 32.1718 | 1.02262 | 1.02392 | 1.85891 | 2.33123 |
Thanks for the detailed analysis! It's a good point that we can easily compute the best fit for a given problem size. If I'm understanding correctly, the take away is that FFTs are at most 2x slower, right?
As an aside, there is a slight downside to consider fields of size > 255 bits: you have to add an another limb to all calculations on the scalar field, which includes constraint generation, witness generation, FFTs, etc. I don't know of a way to simplify this, beyond making a different BigInteger type (for example: BigInt264([u64; 4], u8)
, and even then, I don't know if it would make a concrete difference due to alignment issues). (I'm assuming that these curves still have base field < 320 bits, because otherwise one would lose the benefit of using BLS24 curves)
Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.
About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.
Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.
About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.
first results from the running script:
In particular the last with smoothness bound 11 for both domains is interesting. It's approximation goodness comes with an rel. error < 1.7%, and the increase of FFT performance is only about 1.5.
The current integration of mixed-radix FFTs is kind of unsatisfactory. There are two issues:
To resolve this, we might want to implement a "dynamic" domain that automatically chooses a subgroup whose size matches the required input size as closely as possible. This is different from the behaviour of something like
GeneralEvaluationDomain
, which tries to fit things into radix-2 if possible, and only goes for larger domain sizes if radix-2 is insufficient. This would also unify the existingRadix2Domain
,MixedRadixDomain
, andGeneralEvaluationDomain
impls.cc @weikengchen @paberr