arkworks-rs / algebra

Libraries for finite field, elliptic curve, and polynomial arithmetic
https://arkworks.rs
Apache License 2.0
643 stars 251 forks source link

Understand trade-offs of multi-radix domains and FFTs #24

Open Pratyush opened 4 years ago

Pratyush commented 4 years ago

The current integration of mixed-radix FFTs is kind of unsatisfactory. There are two issues:

cc @weikengchen @paberr

ValarDragon commented 4 years ago

A relevant data point is how FFTW does this. They have the caller create a "plan" for FFTs based on the domain size, which then AFAIU does precompilation and preprocessing for the exact domains and FFT plan they end up using. http://www.fftw.org/fftw-paper.pdf

paberr commented 4 years ago

That sounds like a good idea to me.

  • Even after fixing the above, there's still the issue that we only select the larger radix after exhausting all powers in radix-2. As suggested in scipr-lab/zexe#156 (comment), this can result in poor "resizing behaviour", which is demonstrated via the following example:
Required size: 10;
GeneralEvaluationDomain: domain_size = 2^4 = 16
DynamicEvaluationDomain: domain_size = 2 * 5 = 10

One question regarding this part: I would imagine that a pure radix-2 FFT is more efficient than a mixed-radix one. So, when the achievable domain sizes are close, a slightly larger radix-2 FFT might be preferable to a mixed-radix one, no? I haven't had a look at the FFTW paper yet, but perhaps they do have some insights into the trade-offs here.

UlrichHaboeck75 commented 4 years ago

My five cents on the smoothness of the FFT domain and its approximation goodness for target sizes x from a given interval I= [x_min, x_max].

Given an FFT domain of mixed order n = p_1^e_1 * ... * p_k^e_k, with p_i being the prime factors and e_i their exponents, and let d(x,L) = min_{y in L} y/x the minimal 'distance' of x by numbers from the discrete set L of all possible subdomain sizes y = p_ 1^{y_ 1}*...*p_k^{y_k} with y_i from {0,1,...,e_i}. In our case the number of such combinations is small enough to find the best approximation via enumeration, and even plots of d(x,L) varying over x are easily produced to get an impression for the uniform approximation goodness

delta = max_{x in I} d(x,L).

In general given I one has to oversize the FFT domain to achieve a reasonable small approximation goodness (corresponding to a relative error < 5%, say) and the more prime factors of n the better. However, having too large p_i increases the computational complexity of the mixed FFT, which is

T(n) = n * Sum_i e_i (pi - 1) (M+A),

where M and A is for field multiplication and addition, respectively. In practice the increase is acceptable (typically a factor of 2 relative to the duadic case, see the table below) and is outweighted by the benefit of small delta.

In connection with the task of searching suitable BLS24 parameter (with group sizes bits of more than 256) I executed a script over a slightly larger set than the one of Pratyush (starting from parameter x = 2^32). I filtered out those cases which allow sufficiently smooth FFT domains in both the base field as well the scalar field (having both log2(n_r) and log(n_q) >= 30 bit and prime factors <= 17). I further computed the goodnesses delta_r and delta_q for the interval I=[2^10, 2^25] using an explicit algorithm to solve the discrete max-min problem, and determined the performances T_r , T_q of their FFT relative to the equivalent duadic case T(duadic)= T(2^log n) using the above formula. These are the results:

x bits log2(n_r) log2(n_q) delta_r delta_q T(n_r)/T(duadic) T(n_q)/T(duadic)
-4346495999 257 41.8051 32.0172 1.00689 1.02262 2.00932 1.68659
-4505419775 257 43.3203 31.2246 1.02218 1.02602 1.66204 1.98561
-4518481967 257 30.9689 31.2908 1.02354 1.02963 1.74369 1.78966
4796508718 258 32.5132 31.1912 1.01425 1.02354 1.84541 1.82744
-4975851464 258 36.664 30.6266 1.0104 1.05488 1.85468 2.12234
-5032487474 258 48.4004 34.2411 1.01089 1.02001 2.0661 1.51864
-5131697570 259 39.3547 30.067 1.00782 1.04167 2.28689 2.46117
-5155487999 259 34.2635 32.2635 1.0111 1.02262 1.80951 1.85969
5372915626 259 41.4331 32.8062 1.01169 1.0242 2.26872 2.31663
-5482229024 259 49.8397 31.5071 1.01705 1.03822 2.00643 2.25346
-5712214364 260 32.277 32.1718 1.02262 1.02392 1.85891 2.33123
Pratyush commented 4 years ago

Thanks for the detailed analysis! It's a good point that we can easily compute the best fit for a given problem size. If I'm understanding correctly, the take away is that FFTs are at most 2x slower, right?

As an aside, there is a slight downside to consider fields of size > 255 bits: you have to add an another limb to all calculations on the scalar field, which includes constraint generation, witness generation, FFTs, etc. I don't know of a way to simplify this, beyond making a different BigInteger type (for example: BigInt264([u64; 4], u8), and even then, I don't know if it would make a concrete difference due to alignment issues). (I'm assuming that these curves still have base field < 320 bits, because otherwise one would lose the benefit of using BLS24 curves)

UlrichHaboeck75 commented 4 years ago

Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.

About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.

UlrichHaboeck75 commented 4 years ago

Yes, if one considers such small approximation goodness (2% rel. error) for that instance interval (with upper bound 2^25) one has to expect about a twice as slow FFT. Actually it turned out that my script was starting at a too high x, searching only curves of min 257 bit. I now restarted it with modified range for x, and post my findings in few days.

About the sizes of the base field modulus, the first three rows in the table have 320 bit modulus, the others 321-323 bits.

first results from the running script: Screenshot from 2020-11-04 13-11-02

In particular the last with smoothness bound 11 for both domains is interesting. It's approximation goodness comes with an rel. error < 1.7%, and the increase of FFT performance is only about 1.5.