Open jon-chuang opened 4 years ago
Thanks for the comprehensive comparison! I agree that there seem to be some nice avenues for optimization here! Let's collect here the things that goff
is doing to optimize arithmetic.
Nice! Interesting. It seems that G1 additions gnark < zexe < zkcrypto, while gnark is slowest in full pairing.
Dear all, after some more exploration, here are my findings:
I applied Goff's suggested optimisation of no carry, and I got a speedup of 3-5ns (down from 52-54ns), less than advertised, but still significant. The full pairing was about 6% faster with this optimisation.
I rewrote the multiplication as a compact loop and used the crate unroll
to make sure it was unrolled. Doing this did not seem to affect performance at all vis a vis writing the unrolled loops out explicitly. I suggest to keep this style as it is much more readable and compact.
With regards to G1 timings, the gap was narrowed by about half, but I should check back once squaring is optimised too.
My suggestions:
Thanks for the investigation @jon-chuang ! If you could make a PR with the Goff changes and with the unroll!
macro that would be awesome!
Re: benchmarking it’s a slightly more involved process, as there’s a lot of code that would have to be changed :/
@Pratyush Actually, I don't think so. I will write some macros for benches for the algebra-benches crate. I will iterate by 1000 everything accept G1/G2 muls, pairing-related ops, and sqrt.
Not all of the curves have the same extension fields, I will figure out how I want to tackle this. I will split up the macros by type.
By the way, is it normal that one needs +nightly
for #[feature(test)]
?
I believe more accurate benchmarking is mission critical and also makes the Zexe library more attractive.
If you'd like to update the benchmarks anyway, you might want to use something like criterion
.
This would also solve the issue with requiring nightly for benches.
Criterion is a lot slower and verbose however, we may not want it for minor benchmarks. It takes ages to run the full benchmark as it is.
Although, I could try to find a way to tweak the sampling time. Further, I should look into whether there is similar functionality to cargo benchcmp
.
I noticed a troubling phenomenon. G1 doubling seems to be slower after changing to no-carry mult (about 5%). This seems to affect G1 scalar mult as well. @Pratyush , can you replicate this?
I believe this would help us in our diagnosis of why G1 ops are much slower than in gnark. Could it be something to do with inlining/memory usage (e.g. cache-friendliness)? It seems odd that even though our Fp ops are faster or on par with goff, and even though we both use the exact same formulas, the go implementation would be faster than rust. It's a bit of an enigma. I don't know if careful profiling will help.
I did a final careful benchmarking of the recent PR arkworks-rs/snark#163. My first reccomendation is to try adding RUSTFLAGS="--emit=asm"
as this can improve speed significantly.
Here is a graphic summarising my results :) Regressions are largely noise from small functions.
I also find that Fq mul_assign
is about 7-15% faster, square_in_place
is about 5% faster, into_repr
is about 5-12% faster, and pairing computations are about 1.5-6% faster.
The timings I got vis a vis goff/gnark are essentially the same as my intermediate report. G1 is the only area where we are still lagging behind.
New avenues of optimisation I would like to look into are G1 add_assigned_mixed
, which I believe is part of the bottleneck for G1 mul_assign
. However, the majority of the bottleneck seems to come from pippenger exponentiation used in gnark, where they compute a pippenger table beforehand. So this is something I will work on for now, which is in parallel with my intention to add WebGPU support for the parallel form of multiexponentiation.
By the way, there are no benchmarks for MSM, but this is a very critical number for provers.
I think the issue with scalar multiplication should be gone now? From local benchmarks mixed_add, add, and double are all faster
Btw the scalar multiplication benchmarks are noisy because the scalars are random and have varying hamming weight, so that’s something to keep in mind
@Pratyush nonetheless, the scalar_mul
timings are significantly slower (1.22x) than gnark. Further, the mixed_add
timings are 1.06x slower. This is a finding I repeatedly arrive at.
I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @pratyush?
I think I will run more detailed 1000-fold benchmarks to get to the bottom of which primitive operations (e.g. add/div2). This could also be another benchmarking difference issue in terms of the randomness. Perhaps I should check if I'm looping through random elements.
Hmm inversion isn't strictly performance critical because the elliptic curve code does try to minimize it. For the sqrt code, it would be helpful to optimize it is critical for point compression.
(That being said, having faster algorithms is always welcomed =D)
Also, thanks again for the thorough benchmarks!
I see. Is there any scenario in which batch inversion actually helps performance? Perhaps for something like pippenger-scalar mul/MSM?
I am thinking of affine coordinates + batch inversion.
Currently our MSM code takes in affine coordinates and uses mixed addition.
How about batch inversion in particular?
Hmm since batch_inversion performs O(N) multiplications but only one inversion, the cost of a single inversion is usually drowned out for large N. However, using batch inversion with a small N might make the additional cost of inversion significant again.
I see. But where is batch inversion used? You mentioned it is used for mixed addition? Or is that standard addition?
I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?
Goff inversion is not constant-time. If that is desired, you cannot compare it with the usual Little Fermat's based inversion.
Also as mentionned by @Pratyush for pairing in a Fp12 tower, the tower setup passes the inversion down the tower and so the ratio compared to a Fp operation stays acceptable and we can use Jacobian/Projective coordinates to avoid the inversion tax (at the price of more addition/multiplication).
More discussion on fast constant-time modular inversion here: https://github.com/randombit/botan/issues/1479#issuecomment-602178718
Also, AFAIK the fastest open-source pairing library is MCL if your CPU is at least Broadwell (from) and supports MULX/ADOX/ADCX instructions. The gain is about 15% on my machine. See benchmarks at https://github.com/mratsim/constantine/pull/18 (Note that in those benchmark I was still using inversion via Little Fermat Exponentiation, my new constant time inversion algorithm is about 30% faster though 10x slower than Goff non-constant-time version or GMP).
Since we are on par with your library, that would make us about 22-30 clock cycles behind MCL. That's roughly 25% extra bloat. That's remarkable.
@pratyush, given this amazing result, I may find some time to read the MCL code and implement the ADX instructions in a maintainable way, since x86 intrinsics do not support them due to lack of support from LLVM.
Hmm I'm pretty sure we don't have a constant time inversion algorithm; we use the binary euclidean algorithm which isn't constant time: https://github.com/scipr-lab/zexe/blob/master/algebra-core/src/fields/macros.rs
You can find filecoins new implementation using assembly based optimizations in https://github.com/filecoin-project/blstrs, would be interesting to see how this compares.
Numbers on my AMD Ryzen for blstrs:
Comparison curve-benches bls12-381 from arkworks
Wow blstrs' fp12 multiplications are significantly faster! (The fp multiplication times are the same in both, but the fp12 multiplication in blstrs is half the time)
One of the things blst does is take advantage of #121 within its quadratic extension arithmetic (see: https://github.com/supranational/blst/blob/master/src/fp12_tower.c#L309) We may need to actually return an unreduced form in #121 to make these additions into it worth it though.
The latest cross-pairing libraries benchmarks were gathered by @yelhousni in https://hackmd.io/@zkteam/eccbench
I also noted all optimizations I am aware of for pairing libraries here: https://github.com/mratsim/constantine/blob/2c5e12d/docs/optimizations.md
@ValarDragon, BLST, MCL and Gurvy do indeed use unreduced multiplication on higher field during the towering.
The technique is described in the following paper by @herumi (MCL author)
And is generalized to all extension fields by @dfaranha (Relic author)
Incredible how much faster pairing implementations have gotten in the past few years. Keep up the good work, you all! :)
@mratsim I think these are really great options to look into as our field towers are very slow vis a vis other libraries, even though we have improved on the Fp side. Thanks!
Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.
From these results, it seems that major areas of optimisation are in
More benchmarking should be done once Fq has been sufficiently optimised.
Yet to benchmark: zk-swap-libff/libff barretenberg
Results: BLS12-381 Fq goff Zexe
G1 gnark Zexe zkcrypto
Pairing Zexe zkcrypto gnark