Results from benchmarking pairing libraries

jon-chuang commented 4 years ago

Hi all, I would like to share with you guys some benchmarks I ran on zexe, gnark/goff and zkcrypto's BL12-381. I believe this is a good place to start to identify what the key avenues for optimisation are.

From these results, it seems that major areas of optimisation are in

Fq operations, including: add, sub, mul, inverse, negate, to and from montgomery form. Analogous results hold for field extensions, but we should check for improvement here after optimising the base field. From pure optimisations in Fq ops, one can aspect an overall improvement in other parts of the library by approximately 120%.
G1 ops: this can be partly explained by the Fq performance. However, scalar mul seems significantly slower, should check what differences exist between gnark and zexe.

More benchmarking should be done once Fq has been sufficiently optimised.

Yet to benchmark: zk-swap-libff/libff barretenberg

Results: BLS12-381 Fq goff Zexe

G1 gnark Zexe zkcrypto

Pairing Zexe zkcrypto gnark

Pratyush commented 4 years ago

Thanks for the comprehensive comparison! I agree that there seem to be some nice avenues for optimization here! Let's collect here the things that goff is doing to optimize arithmetic.

kobigurk commented 4 years ago

Nice! Interesting. It seems that G1 additions gnark < zexe < zkcrypto, while gnark is slowest in full pairing.

jon-chuang commented 4 years ago

Dear all, after some more exploration, here are my findings:

First of all, I tried to keep extraneous apps to zero this time, which explains my overall lower timings.
The way that Goff and other libraries benchmark is different from how Zexe benchmarks. Zexe makes one function call per b.iter loop. As I suspect, due to the function call overhead, this severely affects the timing of functions that run very quickly i.e. on the order of ns. Hence, my suggestion is that for such functions, one should iterate over a for loop inside the b.iter call for 1000 iterations instead.
Applying this to fields of size 384 and 256, I got Fq mult timings of 46-51ns and 25-26ns respectively, and Fq squarings of 44-47ns and 23-24ns. This is on par or faster than goff (I got new timings for goff: 384: 50-53 ns mult/48-49 ns square and 256: 25-26 ns mult/23-24 ns square). This is actually close to the Barretenberg 256 timings of 20/22ns written in optimised assembly.

I applied Goff's suggested optimisation of no carry, and I got a speedup of 3-5ns (down from 52-54ns), less than advertised, but still significant. The full pairing was about 6% faster with this optimisation.

I rewrote the multiplication as a compact loop and used the crate unroll to make sure it was unrolled. Doing this did not seem to affect performance at all vis a vis writing the unrolled loops out explicitly. I suggest to keep this style as it is much more readable and compact.

With regards to G1 timings, the gap was narrowed by about half, but I should check back once squaring is optimised too.

My suggestions:

[ ] Benchmark all functions running at < 500 ns by iterating thousand-fold
[x] Macro-unroll all field muls, squarings and montgomery reductions for more compact and readable form
[ ] Implement no-carry optimisation for squaring

Pratyush commented 4 years ago

Thanks for the investigation @jon-chuang ! If you could make a PR with the Goff changes and with the unroll! macro that would be awesome!

Re: benchmarking it’s a slightly more involved process, as there’s a lot of code that would have to be changed :/

jon-chuang commented 4 years ago

@Pratyush Actually, I don't think so. I will write some macros for benches for the algebra-benches crate. I will iterate by 1000 everything accept G1/G2 muls, pairing-related ops, and sqrt.

Not all of the curves have the same extension fields, I will figure out how I want to tackle this. I will split up the macros by type.

By the way, is it normal that one needs +nightly for #[feature(test)]?

I believe more accurate benchmarking is mission critical and also makes the Zexe library more attractive.

Pratyush commented 4 years ago

If you'd like to update the benchmarks anyway, you might want to use something like criterion.

This would also solve the issue with requiring nightly for benches.

jon-chuang commented 4 years ago

Criterion is a lot slower and verbose however, we may not want it for minor benchmarks. It takes ages to run the full benchmark as it is.

Although, I could try to find a way to tweak the sampling time. Further, I should look into whether there is similar functionality to cargo benchcmp.

jon-chuang commented 4 years ago

I noticed a troubling phenomenon. G1 doubling seems to be slower after changing to no-carry mult (about 5%). This seems to affect G1 scalar mult as well. @Pratyush , can you replicate this?

G1 Benchmark

``` name old-ec ns/iter no_carry_ec ns/iter diff ns/iter diff % speedup bls12_377::ec::g1::bench_g1_add_assign 1,178 1,158 -20 -1.70% x 1.02 bls12_377::ec::g1::bench_g1_add_assign_mixed 820 819 -1 -0.12% x 1.00 bls12_377::ec::g1::bench_g1_double 591 602 11 1.86% x 0.98 bls12_377::ec::g1::bench_g1_mul_assign 295,849 297,099 1,250 0.42% x 1.00 bls12_377::ec::g1::bench_g1_rand 294,117 294,953 836 0.28% x 1.00 bls12_377::ec::g2::bench_g2_add_assign 5,121 5,018 -103 -2.01% x 1.02 bls12_377::ec::g2::bench_g2_add_assign_mixed 3,546 3,837 291 8.21% x 0.92 bls12_377::ec::g2::bench_g2_double 2,332 2,279 -53 -2.27% x 1.02 bls12_377::ec::g2::bench_g2_mul_assign 1,237,827 1,196,713 -41,114 -3.32% x 1.03 bls12_377::ec::g2::bench_g2_rand 1,193,239 1,180,192 -13,047 -1.09% x 1.01 bls12_381::ec::g1::bench_g1_add_assign 1,172 1,170 -2 -0.17% x 1.00 bls12_381::ec::g1::bench_g1_add_assign_mixed 817 859 42 5.14% x 0.95 bls12_381::ec::g1::bench_g1_double 588 614 26 4.42% x 0.96 bls12_381::ec::g1::bench_g1_mul_assign 297,426 305,844 8,418 2.83% x 0.97 bls12_381::ec::g1::bench_g1_rand 299,963 303,203 3,240 1.08% x 0.99 bls12_381::ec::g2::bench_g2_add_assign 4,229 4,122 -107 -2.53% x 1.03 bls12_381::ec::g2::bench_g2_add_assign_mixed 2,902 3,158 256 8.82% x 0.92 bls12_381::ec::g2::bench_g2_double 1,815 1,788 -27 -1.49% x 1.02 bls12_381::ec::g2::bench_g2_mul_assign 980,179 963,952 -16,227 -1.66% x 1.02 bls12_381::ec::g2::bench_g2_rand 1,003,275 951,161 -52,114 -5.19% x 1.05 sw6::ec::g1::bench_g1_add_assign 4,416 4,403 -13 -0.29% x 1.00 sw6::ec::g1::bench_g1_add_assign_mixed 3,228 3,094 -134 -4.15% x 1.04 sw6::ec::g1::bench_g1_double 2,765 2,927 162 5.86% x 0.94 sw6::ec::g1::bench_g1_mul_assign 1,898,836 1,928,669 29,833 1.57% x 0.98 sw6::ec::g1::bench_g1_rand 1,916,670 1,931,870 15,200 0.79% x 0.99 sw6::ec::g2::bench_g2_add_assign 35,544 33,616 -1,928 -5.42% x 1.06 sw6::ec::g2::bench_g2_add_assign_mixed 23,615 22,878 -737 -3.12% x 1.03 sw6::ec::g2::bench_g2_double 19,324 19,530 206 1.07% x 0.99 sw6::ec::g2::bench_g2_mul_assign 13,692,588 13,392,381 -300,207 -2.19% x 1.02 sw6::ec::g2::bench_g2_rand 13,614,042 13,254,145 -359,897 -2.64% x 1.03 ```

I believe this would help us in our diagnosis of why G1 ops are much slower than in gnark. Could it be something to do with inlining/memory usage (e.g. cache-friendliness)? It seems odd that even though our Fp ops are faster or on par with goff, and even though we both use the exact same formulas, the go implementation would be faster than rust. It's a bit of an enigma. I don't know if careful profiling will help.

jon-chuang commented 4 years ago

I did a final careful benchmarking of the recent PR arkworks-rs/snark#163. My first reccomendation is to try adding RUSTFLAGS="--emit=asm" as this can improve speed significantly.

Here is a graphic summarising my results :) Regressions are largely noise from small functions. Screenshot from 2020-04-03 13-04-13

Full Benchmark

``` name old-asm ns/iter new ns/iter diff ns/iter diff % speedup bls12_377::ec::g1::bench_g1_add_assign 1,104,221 1,048,201 -56,020 -5.07% x 1.05 bls12_377::ec::g1::bench_g1_add_assign_mixed 769,072 749,362 -19,710 -2.56% x 1.03 bls12_377::ec::g1::bench_g1_double 546,359 535,695 -10,664 -1.95% x 1.02 bls12_377::ec::g1::bench_g1_mul_assign 280,718 269,918 -10,800 -3.85% x 1.04 bls12_377::ec::g1::bench_g1_rand 283,331 267,739 -15,592 -5.50% x 1.06 bls12_377::ec::g2::bench_g2_add_assign 4,907 4,612 -295 -6.01% x 1.06 bls12_377::ec::g2::bench_g2_add_assign_mixed 3,416 3,232 -184 -5.39% x 1.06 bls12_377::ec::g2::bench_g2_double 2,270 2,158 -112 -4.93% x 1.05 bls12_377::ec::g2::bench_g2_mul_assign 1,165,130 1,102,020 -63,110 -5.42% x 1.06 bls12_377::ec::g2::bench_g2_rand 1,130,329 1,080,859 -49,470 -4.38% x 1.05 bls12_377::fq12::bench_fq12_add_assign 132 133 1 0.76% x 0.99 bls12_377::fq12::bench_fq12_double 130 126 -4 -3.08% x 1.03 bls12_377::fq12::bench_fq12_inverse 26,231 25,742 -489 -1.86% x 1.02 bls12_377::fq12::bench_fq12_mul_assign 7,111 6,874 -237 -3.33% x 1.03 bls12_377::fq12::bench_fq12_square 4,764 4,723 -41 -0.86% x 1.01 bls12_377::fq12::bench_fq12_sub_assign 125 119 -6 -4.80% x 1.05 bls12_377::fq2::bench_fq2_add_assign 22 17 -5 -22.73% x 1.29 bls12_377::fq2::bench_fq2_double 18 17 -1 -5.56% x 1.06 bls12_377::fq2::bench_fq2_inverse 14,282 13,930 -352 -2.46% x 1.03 bls12_377::fq2::bench_fq2_mul_assign 278 249 -29 -10.43% x 1.12 bls12_377::fq2::bench_fq2_sqrt 132,251 127,995 -4,256 -3.22% x 1.03 bls12_377::fq2::bench_fq2_square 252 231 -21 -8.33% x 1.09 bls12_377::fq2::bench_fq2_sub_assign 19 19 0 0.00% x 1.00 bls12_377::fq::bench_fq_add_assign 11 10 -1 -9.09% x 1.10 bls12_377::fq::bench_fq_double 9,326 9,108 -218 -2.34% x 1.02 bls12_377::fq::bench_fq_from_repr 64 59 -5 -7.81% x 1.08 bls12_377::fq::bench_fq_into_repr 35 32 -3 -8.57% x 1.09 bls12_377::fq::bench_fq_inverse 14,236 13,708 -528 -3.71% x 1.04 bls12_377::fq::bench_fq_mul_assign 59,431 51,780 -7,651 -12.87% x 1.15 bls12_377::fq::bench_fq_negate 15 13 -2 -13.33% x 1.15 bls12_377::fq::bench_fq_repr_add_nocarry 8 8 0 0.00% x 1.00 bls12_377::fq::bench_fq_repr_div2 6 6 0 0.00% x 1.00 bls12_377::fq::bench_fq_repr_mul2 5 5 0 0.00% x 1.00 bls12_377::fq::bench_fq_repr_num_bits 4 4 0 0.00% x 1.00 bls12_377::fq::bench_fq_repr_sub_noborrow 9 9 0 0.00% x 1.00 bls12_377::fq::bench_fq_sqrt 81,512 77,501 -4,011 -4.92% x 1.05 bls12_377::fq::bench_fq_square 46,611 43,704 -2,907 -6.24% x 1.07 bls12_377::fq::bench_fq_sub_assign 11 11 0 0.00% x 1.00 bls12_377::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00 bls12_377::fr::bench_fr_double 10,267 10,316 49 0.48% x 1.00 bls12_377::fr::bench_fr_from_repr 35 31 -4 -11.43% x 1.13 bls12_377::fr::bench_fr_into_repr 21 20 -1 -4.76% x 1.05 bls12_377::fr::bench_fr_inverse 7,623 7,516 -107 -1.40% x 1.01 bls12_377::fr::bench_fr_mul_assign 28,718 25,583 -3,135 -10.92% x 1.12 bls12_377::fr::bench_fr_negate 10 10 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_add_nocarry 6 6 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_div2 4 4 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_mul2 4 4 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_num_bits 4 4 0 0.00% x 1.00 bls12_377::fr::bench_fr_repr_sub_noborrow 6 6 0 0.00% x 1.00 bls12_377::fr::bench_fr_sqrt 32,498 30,712 -1,786 -5.50% x 1.06 bls12_377::fr::bench_fr_square 23,078 21,929 -1,149 -4.98% x 1.05 bls12_377::fr::bench_fr_sub_assign 8 8 0 0.00% x 1.00 bls12_377::pairing::pairing::bench_pairing_final_exponentiation 1,311,746 1,292,296 -19,450 -1.48% x 1.02 bls12_377::pairing::pairing::bench_pairing_full 2,270,597 2,235,430 -35,167 -1.55% x 1.02 bls12_377::pairing::pairing::bench_pairing_miller_loop 638,257 631,072 -7,185 -1.13% x 1.01 bls12_381::ec::g1::bench_g1_add_assign 1,098,227 1,059,324 -38,903 -3.54% x 1.04 bls12_381::ec::g1::bench_g1_add_assign_mixed 778,427 758,426 -20,001 -2.57% x 1.03 bls12_381::ec::g1::bench_g1_double 552,411 539,424 -12,987 -2.35% x 1.02 bls12_381::ec::g1::bench_g1_mul_assign 287,407 274,129 -13,278 -4.62% x 1.05 bls12_381::ec::g1::bench_g1_rand 286,258 274,444 -11,814 -4.13% x 1.04 bls12_381::ec::g2::bench_g2_add_assign 4,909 4,660 -249 -5.07% x 1.05 bls12_381::ec::g2::bench_g2_add_assign_mixed 3,401 3,267 -134 -3.94% x 1.04 bls12_381::ec::g2::bench_g2_double 2,262 2,175 -87 -3.85% x 1.04 bls12_381::ec::g2::bench_g2_mul_assign 1,169,377 1,116,571 -52,806 -4.52% x 1.05 bls12_381::ec::g2::bench_g2_rand 1,139,891 1,084,845 -55,046 -4.83% x 1.05 bls12_381::fq12::bench_fq12_add_assign 142 130 -12 -8.45% x 1.09 bls12_381::fq12::bench_fq12_double 126 126 0 0.00% x 1.00 bls12_381::fq12::bench_fq12_inverse 24,593 24,139 -454 -1.85% x 1.02 bls12_381::fq12::bench_fq12_mul_assign 6,445 6,199 -246 -3.82% x 1.04 bls12_381::fq12::bench_fq12_square 4,256 4,118 -138 -3.24% x 1.03 bls12_381::fq12::bench_fq12_sub_assign 122 119 -3 -2.46% x 1.03 bls12_381::fq2::bench_fq2_add_assign 22 18 -4 -18.18% x 1.22 bls12_381::fq2::bench_fq2_double 17 17 0 0.00% x 1.00 bls12_381::fq2::bench_fq2_inverse 14,223 14,351 128 0.90% x 0.99 bls12_381::fq2::bench_fq2_mul_assign 238 225 -13 -5.46% x 1.06 bls12_381::fq2::bench_fq2_sqrt 121,519 124,951 3,432 2.82% x 0.97 bls12_381::fq2::bench_fq2_square 187 177 -10 -5.35% x 1.06 bls12_381::fq2::bench_fq2_sub_assign 19 19 0 0.00% x 1.00 bls12_381::fq::bench_fq_add_assign 16 10 -6 -37.50% x 1.60 bls12_381::fq::bench_fq_double 9,660 9,240 -420 -4.35% x 1.05 bls12_381::fq::bench_fq_from_repr 63 59 -4 -6.35% x 1.07 bls12_381::fq::bench_fq_into_repr 35 32 -3 -8.57% x 1.09 bls12_381::fq::bench_fq_inverse 13,813 14,949 1,136 8.22% x 0.92 bls12_381::fq::bench_fq_mul_assign 56,248 52,143 -4,105 -7.30% x 1.08 bls12_381::fq::bench_fq_negate 13 14 1 7.69% x 0.93 bls12_381::fq::bench_fq_repr_add_nocarry 8 8 0 0.00% x 1.00 bls12_381::fq::bench_fq_repr_div2 6 6 0 0.00% x 1.00 bls12_381::fq::bench_fq_repr_mul2 5 6 1 20.00% x 0.83 bls12_381::fq::bench_fq_repr_num_bits 4 4 0 0.00% x 1.00 bls12_381::fq::bench_fq_repr_sub_noborrow 9 9 0 0.00% x 1.00 bls12_381::fq::bench_fq_sqrt 61,254 61,857 603 0.98% x 0.99 bls12_381::fq::bench_fq_square 44,887 44,295 -592 -1.32% x 1.01 bls12_381::fq::bench_fq_sub_assign 11 11 0 0.00% x 1.00 bls12_381::fr::bench_fr_add_assign 7 7 0 0.00% x 1.00 bls12_381::fr::bench_fr_double 10,176 10,284 108 1.06% x 0.99 bls12_381::fr::bench_fr_from_repr 34 31 -3 -8.82% x 1.10 bls12_381::fr::bench_fr_into_repr 21 20 -1 -4.76% x 1.05 bls12_381::fr::bench_fr_inverse 7,485 7,600 115 1.54% x 0.98 bls12_381::fr::bench_fr_mul_assign 28,764 27,166 -1,598 -5.56% x 1.06 bls12_381::fr::bench_fr_negate 10 10 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_add_nocarry 6 6 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_div2 4 4 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_mul2 4 4 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_num_bits 4 4 0 0.00% x 1.00 bls12_381::fr::bench_fr_repr_sub_noborrow 6 7 1 16.67% x 0.86 bls12_381::fr::bench_fr_sqrt 28,676 27,321 -1,355 -4.73% x 1.05 bls12_381::fr::bench_fr_square 25,588 24,908 -680 -2.66% x 1.03 bls12_381::fr::bench_fr_sub_assign 8 8 0 0.00% x 1.00 bls12_381::pairing::pairing::bench_pairing_final_exponentiation 1,118,687 1,072,725 -45,962 -4.11% x 1.04 bls12_381::pairing::pairing::bench_pairing_full 1,958,878 1,871,067 -87,811 -4.48% x 1.05 bls12_381::pairing::pairing::bench_pairing_miller_loop 572,674 546,691 -25,983 -4.54% x 1.05 sw6::ec::g1::bench_g1_add_assign 4,553,369 4,185,546 -367,823 -8.08% x 1.09 sw6::ec::g1::bench_g1_add_assign_mixed 3,180,244 2,963,674 -216,570 -6.81% x 1.07 sw6::ec::g1::bench_g1_double 2,797,084 2,624,621 -172,463 -6.17% x 1.07 sw6::ec::g1::bench_g1_mul_assign 1,920,092 1,795,032 -125,060 -6.51% x 1.07 sw6::ec::g1::bench_g1_rand 1,921,952 1,798,136 -123,816 -6.44% x 1.07 sw6::ec::g2::bench_g2_add_assign 4,897 4,600 -297 -6.06% x 1.06 sw6::ec::g2::bench_g2_add_assign_mixed 3,411 3,206 -205 -6.01% x 1.06 sw6::ec::g2::bench_g2_double 2,264 2,140 -124 -5.48% x 1.06 sw6::ec::g2::bench_g2_mul_assign 1,164,381 1,098,494 -65,887 -5.66% x 1.06 sw6::ec::g2::bench_g2_rand 1,136,139 1,076,092 -60,047 -5.29% x 1.06 sw6::fq3::bench_fq3_add_assign 67 75 8 11.94% x 0.89 sw6::fq3::bench_fq3_double 52 50 -2 -3.85% x 1.04 sw6::fq3::bench_fq3_inverse 51,711 50,737 -974 -1.88% x 1.02 sw6::fq3::bench_fq3_mul_assign 2,315 2,121 -194 -8.38% x 1.09 sw6::fq3::bench_fq3_sqrt 3,562,638 3,321,726 -240,912 -6.76% x 1.07 sw6::fq3::bench_fq3_square 1,788 1,696 -92 -5.15% x 1.05 sw6::fq3::bench_fq3_sub_assign 67 67 0 0.00% x 1.00 sw6::fq6::bench_fq6_add_assign 129 144 15 11.63% x 0.90 sw6::fq6::bench_fq6_double 97 96 -1 -1.03% x 1.01 sw6::fq6::bench_fq6_inverse 60,789 58,863 -1,926 -3.17% x 1.03 sw6::fq6::bench_fq6_mul_assign 7,593 7,118 -475 -6.26% x 1.07 sw6::fq6::bench_fq6_square 5,546 5,248 -298 -5.37% x 1.06 sw6::fq6::bench_fq6_sub_assign 134 131 -3 -2.24% x 1.02 sw6::fq::bench_fq_add_assign 24 22 -2 -8.33% x 1.09 sw6::fq::bench_fq_double 16,264 16,932 668 4.11% x 0.96 sw6::fq::bench_fq_from_repr 270 245 -25 -9.26% x 1.10 sw6::fq::bench_fq_into_repr 148 138 -10 -6.76% x 1.07 sw6::fq::bench_fq_inverse 49,467 50,218 751 1.52% x 0.99 sw6::fq::bench_fq_mul_assign 262,063 242,531 -19,532 -7.45% x 1.08 sw6::fq::bench_fq_negate 25 24 -1 -4.00% x 1.04 sw6::fq::bench_fq_repr_add_nocarry 15 14 -1 -6.67% x 1.07 sw6::fq::bench_fq_repr_div2 15 15 0 0.00% x 1.00 sw6::fq::bench_fq_repr_mul2 9 9 0 0.00% x 1.00 sw6::fq::bench_fq_repr_num_bits 4 4 0 0.00% x 1.00 sw6::fq::bench_fq_repr_sub_noborrow 17 17 0 0.00% x 1.00 sw6::fq::bench_fq_sqrt 565,773 526,690 -39,083 -6.91% x 1.07 sw6::fq::bench_fq_square 220,551 208,952 -11,599 -5.26% x 1.06 sw6::fq::bench_fq_sub_assign 20 19 -1 -5.00% x 1.05 sw6::fr::bench_fr_add_assign 10 10 0 0.00% x 1.00 sw6::fr::bench_fr_double 9,619 9,174 -445 -4.63% x 1.05 sw6::fr::bench_fr_from_repr 62 58 -4 -6.45% x 1.07 sw6::fr::bench_fr_into_repr 36 32 -4 -11.11% x 1.12 sw6::fr::bench_fr_inverse 14,024 13,712 -312 -2.22% x 1.02 sw6::fr::bench_fr_mul_assign 56,543 50,664 -5,879 -10.40% x 1.12 sw6::fr::bench_fr_negate 15 13 -2 -13.33% x 1.15 sw6::fr::bench_fr_repr_add_nocarry 8 8 0 0.00% x 1.00 sw6::fr::bench_fr_repr_div2 6 6 0 0.00% x 1.00 sw6::fr::bench_fr_repr_mul2 6 5 -1 -16.67% x 1.20 sw6::fr::bench_fr_repr_num_bits 4 4 0 0.00% x 1.00 sw6::fr::bench_fr_repr_sub_noborrow 9 9 0 0.00% x 1.00 sw6::fr::bench_fr_sqrt 80,292 77,926 -2,366 -2.95% x 1.03 sw6::fr::bench_fr_square 44,921 42,764 -2,157 -4.80% x 1.05 sw6::fr::bench_fr_sub_assign 11 11 0 0.00% x 1.00 sw6::pairing::pairing::bench_pairing_final_exponentiation 9,430,418 8,881,217 -549,201 -5.82% x 1.06 sw6::pairing::pairing::bench_pairing_full 99,900,745 97,824,713 -2,076,032 -2.08% x 1.02 sw6::pairing::pairing::bench_pairing_miller_loop 89,274,175 88,591,942 -682,233 -0.76% x 1.01 ```

I also find that Fq mul_assign is about 7-15% faster, square_in_place is about 5% faster, into_repr is about 5-12% faster, and pairing computations are about 1.5-6% faster.

The timings I got vis a vis goff/gnark are essentially the same as my intermediate report. G1 is the only area where we are still lagging behind.

New avenues of optimisation I would like to look into are G1 add_assigned_mixed, which I believe is part of the bottleneck for G1 mul_assign. However, the majority of the bottleneck seems to come from pippenger exponentiation used in gnark, where they compute a pippenger table beforehand. So this is something I will work on for now, which is in parallel with my intention to add WebGPU support for the parallel form of multiexponentiation.

By the way, there are no benchmarks for MSM, but this is a very critical number for provers.

Pratyush commented 4 years ago

I think the issue with scalar multiplication should be gone now? From local benchmarks mixed_add, add, and double are all faster

Btw the scalar multiplication benchmarks are noisy because the scalars are random and have varying hamming weight, so that’s something to keep in mind

jon-chuang commented 4 years ago

@Pratyush nonetheless, the scalar_mul timings are significantly slower (1.22x) than gnark. Further, the mixed_add timings are 1.06x slower. This is a finding I repeatedly arrive at.

jon-chuang commented 4 years ago

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @pratyush?

I think I will run more detailed 1000-fold benchmarks to get to the bottom of which primitive operations (e.g. add/div2). This could also be another benchmarking difference issue in terms of the randomness. Perhaps I should check if I'm looping through random elements.

Pratyush commented 4 years ago

Hmm inversion isn't strictly performance critical because the elliptic curve code does try to minimize it. For the sqrt code, it would be helpful to optimize it is critical for point compression.

Pratyush commented 4 years ago

(That being said, having faster algorithms is always welcomed =D)

Pratyush commented 4 years ago

Also, thanks again for the thorough benchmarks!

jon-chuang commented 4 years ago

I see. Is there any scenario in which batch inversion actually helps performance? Perhaps for something like pippenger-scalar mul/MSM?

I am thinking of affine coordinates + batch inversion.

Pratyush commented 4 years ago

Currently our MSM code takes in affine coordinates and uses mixed addition.

jon-chuang commented 4 years ago

How about batch inversion in particular?

Pratyush commented 4 years ago

Hmm since batch_inversion performs O(N) multiplications but only one inversion, the cost of a single inversion is usually drowned out for large N. However, using batch inversion with a small N might make the additional cost of inversion significant again.

jon-chuang commented 4 years ago

I see. But where is batch inversion used? You mentioned it is used for mixed addition? Or is that standard addition?

mratsim commented 4 years ago

I also find that inverse is significantly slower than goff (2.2x 13us vs goff’s 6us). Further, sqrt is also very slow (1.5x slower, 75/54us vs goff's ~25/54us). How performance critical are these @Pratyush?

Goff inversion is not constant-time. If that is desired, you cannot compare it with the usual Little Fermat's based inversion.

Also as mentionned by @Pratyush for pairing in a Fp12 tower, the tower setup passes the inversion down the tower and so the ratio compared to a Fp operation stays acceptable and we can use Jacobian/Projective coordinates to avoid the inversion tax (at the price of more addition/multiplication).

More discussion on fast constant-time modular inversion here: https://github.com/randombit/botan/issues/1479#issuecomment-602178718

Also, AFAIK the fastest open-source pairing library is MCL if your CPU is at least Broadwell (from) and supports MULX/ADOX/ADCX instructions. The gain is about 15% on my machine. See benchmarks at https://github.com/mratsim/constantine/pull/18 (Note that in those benchmark I was still using inversion via Little Fermat Exponentiation, my new constant time inversion algorithm is about 30% faster though 10x slower than Goff non-constant-time version or GMP).

jon-chuang commented 4 years ago

Since we are on par with your library, that would make us about 22-30 clock cycles behind MCL. That's roughly 25% extra bloat. That's remarkable.

@pratyush, given this amazing result, I may find some time to read the MCL code and implement the ADX instructions in a maintainable way, since x86 intrinsics do not support them due to lack of support from LLVM.

Pratyush commented 4 years ago

Hmm I'm pretty sure we don't have a constant time inversion algorithm; we use the binary euclidean algorithm which isn't constant time: https://github.com/scipr-lab/zexe/blob/master/algebra-core/src/fields/macros.rs

dignifiedquire commented 3 years ago

You can find filecoins new implementation using assembly based optimizations in https://github.com/filecoin-project/blstrs, would be interesting to see how this compares.

dignifiedquire commented 3 years ago

Numbers on my AMD Ryzen for blstrs:

Screenshot 2020-11-23 at 21 31 42

Comparison curve-benches bls12-381 from arkworks

Screenshot 2020-11-23 at 22 16 02

ValarDragon commented 3 years ago

Wow blstrs' fp12 multiplications are significantly faster! (The fp multiplication times are the same in both, but the fp12 multiplication in blstrs is half the time)

One of the things blst does is take advantage of #121 within its quadratic extension arithmetic (see: https://github.com/supranational/blst/blob/master/src/fp12_tower.c#L309) We may need to actually return an unreduced form in #121 to make these additions into it worth it though.

mratsim commented 3 years ago

The latest cross-pairing libraries benchmarks were gathered by @yelhousni in https://hackmd.io/@zkteam/eccbench

I also noted all optimizations I am aware of for pairing libraries here: https://github.com/mratsim/constantine/blob/2c5e12d/docs/optimizations.md

@ValarDragon, BLST, MCL and Gurvy do indeed use unreduced multiplication on higher field during the towering.

The technique is described in the following paper by @herumi (MCL author)

High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves\ Jean-Luc Beuchat and Jorge Enrique González Díaz and Shigeo Mitsunari and Eiji Okamoto and Francisco Rodríguez-Henríquez and Tadanori Teruya, 2010\ https://eprint.iacr.org/2010/354

And is generalized to all extension fields by @dfaranha (Relic author)

Faster Explicit Formulas for Computing Pairings over Ordinary Curves\ Diego F. Aranha and Koray Karabina and Patrick Longa and Catherine H. Gebotys and Julio López, 2010\ https://eprint.iacr.org/2010/526.pdf \ https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf

dfaranha commented 3 years ago

Incredible how much faster pairing implementations have gotten in the past few years. Keep up the good work, you all! :)

jon-chuang commented 3 years ago

@mratsim I think these are really great options to look into as our field towers are very slow vis a vis other libraries, even though we have improved on the Fp side. Thanks!

arkworks-rs / algebra

Results from benchmarking pairing libraries #80