Implement BLS24-319 Curve

jon-chuang commented 4 years ago

Why? 319-bit prime modulus reduces G1 cost by approx 1.44x vis a vis BLS12-381/377. Constructing curves from Cocks-Pinch method (only known method to generate pairing-friendly curves of prescribed order) doubles the size of the base field, so we want to it to be small to begin with.

Since we assume the leaves of an aggregating proof system have lower proving power, we believe this is an adequate strategy for 1 layer or even 2 layer proof composition.

Future work Future work includes finding suitable Cocks-Pinch Curves for proof composition with BLS24-319. Hopefully, we can find good curves with log_2 p <= 639 and for 2nd layer, log_2 p <= 1279 with highly efficient pairings. Note that the size of the pairing as an arithmetic circuit is about 2x that of BLS12, so the outer most curve has to be at least 2x more efficient (in terms of machine ops/gas costs etc.).

The BW6-761 curve had a pairing time of about 10ms (albeit for a relatively unoptimised implementation), compared with ~2.5 ms for a curve over a prime with half the number of bits, which is a 4x difference. That being said, a slightly over par 1.5x to 2x improvement (on the basis of a smaller base field) seems within reach.

Nonetheless, as one approaches larger field sizes, the costs saved from using a smaller base field may not make up for larger circuits if optimised algorithms like Karatsuba multiplication is used. So one would be trading proof capacity for G1 e.g. proving performance.

BLS48-286 Further, the BLS48-286 curve should also be considered. According to my estimates, they should be 1.5-2x as slow for verification compared to BLS12, but proving could be much faster (9 words on a 32-bit machine). The real benefits can be seen in the outer layer curve, which could have modulus < 572 bits.

Note that this has a much large arithmetic circuit compared with BLS12 - approximately 2.5-5x. So to reap the benefits, the cost of verifying the outer curve has to be < 2.5-5x that of the BW6-761 curve. Based on the discrepancy of base field, a speedup of 1.5-2x is more likely.

Overall, due to the possibly high complexity of the pairing for BLS48, BLS24 seems like a more promising route for now.

Things to investigate

Exponentiation formulas and explicit number of field ops required for BLS48 pairing

Pratyush commented 4 years ago

Yeah, it would be useful to have generic infrastructure for such BLS24 curves too.

Pratyush commented 4 years ago

One issue with these curves of higher embedding degree curves is that G2 arithmetic becomes slower. Some SNARKs, like Groth16 and GM17, rely on G2 arithmetic, so this might not be the best trade-off in all cases.

UlrichHaboeck75 commented 4 years ago

Btw. a 2020 paper of Housni and Guillevic constructs a Brezing-Weng curve on top of the BLS12-377. This approach does not save much field size (761 instead of sw6's 782), but still worth considering it as an alternative to Cocks-Pinch.

mratsim commented 4 years ago

I stumbled across this paper Opcount: A Pseudo-Code Performance Estimation System for Pairing-Based Cryptography where they measured pairing times for BLS24 and BLS48.

I was surprised to see BLS48 being twice faster than BLS24 at the same security level (256-bit) in the figures (estimated figures): which was confirmed in their actual measures

Database is here with curve parameters (the 128/256 is the security level not the field bitwidth): https://github.com/security-kouza/opcount/tree/master/resources/pdb

Note that BLS24 has a 8-way decomposition on G2 and GT while BLS48 has a 16-way decomposition though at that level of decomposition the gain in scalar bits might be dwarfed by the additions of the linear combination of coefficient (it does make Pippenger very attractive).

UlrichHaboeck75 commented 4 years ago

Hm, I do not know the signature scheme measured here (it's a socalled structure-preserving one using Groth-Sahai proofs), but the timings for scalar mult. in G1, G2, and GT might be due to the embedding degree: for same security level (hence the same size of embedding fields), BLS48 curves allow a much smaller field size than BLS24. For example, at 128 bit security, the BLS24-381 mentioned here has a 381 bit base field modulus, whereas the BLS48-286 comes with a 286 bit field. Smaller field size means much more performant G1 scalar mult., same security level is reflected in similar timings for the exp. in GT, but what I did not expect is that the smaller field size overcomes the higher extension degree for G2.

mratsim commented 4 years ago

Indeed,

their BLS48 curve is BLS48-581 (https://github.com/security-kouza/opcount/blob/master/resources/pdb/KH17_BLS48_256.pdb#L8-L18) with a 518 bits order.
their BLS24 curve is BLS24-1089 (https://github.com/security-kouza/opcount/blob/master/resources/pdb/KH17_BLS24_256.pdb#L10-L31) with a 872-bit order.

Pratyush commented 4 years ago

Btw, there's a slight issue with generating BLS24 curves that are doubly two-adic (like BLS12-377): there's only 2^32 possible seeds that fit lead to a scalar field of size <256bits. I enumerated all seeds, and the best doubly-two-adic curve I found had two-adicity of 2^22 and 2^20 on r and q, respectively. This is much too small for applications.

By looking at larger subgroups (2^x 3^y 5^z), I found the following curve:

x is negative: true
x mod 360 = 73
weight(x) = 22
x = -2975367167
q = 18125266818166605849461503394057768008988119390110768226476263130571508416807197345793683456001
r = 6142208155228012302427570697063512528528236733217057581650738834837226782721
log_2(x) = 32
log_2(q) = 314
log_2(r) = 252
r subgroup size = 25.661778097771986
q subgroup size = 28.305634287546713

There are two primary downsides: the awkward subgroup, and the high hamming weight of x, which makes pairings slower. (On the other hand, the shorter size of x results in fewer doubling steps in the miller loop).

UlrichHaboeck75 commented 4 years ago

@Pratyush: I wouldn't regard mixed FFT domains as downside but rather as a useful feature, in particular for proof systems which rely on polynomial identites over cyclic domains (as Marlin or Plonk, e.g.). Different factors allow to fit the FFT domain much better to the number of constraints (instead of jumping to the next power of 2) hence reducing the prover's commitment effort (beside that polynomial arithmetics itself becomes increasingly noticeable at domain sizes > 2^23 ). Inspired by your result I started a Mathematica script to search such mixed domains for a slightly wider set of x (reaching from 2^31 - 2^33). By now the best result is at x=2196504577, yielding a 249 bit group and

r -1 = 2^20 * 3^2 * 7^2,
q - 1 = 2^18 * 3^2 * 7^2,

with 28.78 and 26.78 as their log2, respectively.

P.S. the Mathematica script is awfully slow, I don't expect the full run finished before end of the week.

mratsim commented 4 years ago

I'm not a specialist of finite fields FFT, will look into it later but for FFT on real/complex the most efficient radix are in order: 4, 2, 3, 5: https://github.com/SciNim/impulse/blob/26e25e7/impulse/fft/pocketfft_hdronly.h#L2126-L2133

See also a compilation of my research for real/complex FFT at https://github.com/SciNim/impulse/blob/26e25e7/impulse/fft/README.md

In the long-term, a compiler approach might help optimize FFT kernels just like Halide did over FFTW: https://github.com/halide/Halide/blob/864955e/apps/fft/fft.cpp https://andrew.adams.pub/halide_cacm.pdf

weikengchen commented 3 years ago

Ale mentioned that this paper is related: https://eprint.iacr.org/2021/1359.pdf

And yes, using BLS-24 gives us the advantage that the prime can be smaller (due to the higher embedding degree), and it is useful especially when you need to apply another layer over it.

mratsim commented 3 years ago

@yelhousni is the paper author.

yelhousni commented 3 years ago

@yelhousni is the paper author.

Thank you for tagging me. This is a joint work with @auroreguillevic.

arkworks-rs / algebra

Implement BLS24-319 Curve #733