Consensys / goff

goff (go finite field) is a unix-like tool that generates fast field arithmetic in Go.
https://hackmd.io/@zkteam/goff
Apache License 2.0
76 stars 12 forks source link

Benchmarks #12

Closed jon-chuang closed 3 years ago

jon-chuang commented 4 years ago

Hi, I would like to invite the goff team to update their benchmark timings against zexe's new macro-based FF implementation. We used goff's no carry optimisation. I reference a relevant issue here: https://github.com/scipr-lab/zexe/issues/162.

My benchmarks show that zexe's original squaring implementation is faster (42.5ns vs 46.5ns) than goff's and on par for multiplication (50.2ns vs 50.3ns for goff on my machine).

To verify, please compile my fork with the appropriately modified benchmarks in the folder algebra-benches with RUSTFLAGS="--emit=asm" cargo +nightly bench.

jon-chuang commented 4 years ago

By the way, the timings for Barretenberg look wrong. The last time I checked, it was running at 22ns mul and 20ns squaring. Ours look like 25.5ns and 22.5ns.

jon-chuang commented 4 years ago

Also, @mratsim also brought up MCL, which is frighteningly fast. Once Barretenberg's benchmarking is back up to speed, it would be nice to do a proper benchmark and set the records straight. The next step for us is to use macros to generate ASM strings with ADX (mulx, adoc, adox) instructions for broadwell and newer CPUs.

gbotrel commented 4 years ago

Thanks for that; will add that to our backlog -- when time permits, I was hoping to create a separate repo with proper docker images for each of this libraries benchmarks, not high priority on our side for now though. I suspect few projects in the space are running various benchmarks in various conditions, will be nice to have a "common benchmarking context".

jon-chuang commented 4 years ago

I see. I would have loved to use intrinsics for ADX/MULX, but unfortunately the LLVM and hence rustc compilers don't emit it, so we have to resort to manipulating strings. Out of curiousity, is there intrinsics support in the go language/compiler?

gbotrel commented 4 years ago

Yes, and that's why we didn't need to write assembly so far: https://dave.cheney.net/2019/08/20/go-compiler-intrinsics .

However, the current instrisics don't generate ADX/MULX, so to go that route (or the more crazy VPMADD52LUQ) we will likely resort to assembly for the Mul.

jon-chuang commented 4 years ago

VPAADDMULQ (IFMA) is for unsaturated representation isn't it? Are there any benchmarks that show that they are more performant for general primes? I know they are good for special primes (pseudomersenne/generalised mersenne), but as far as I know no such primes are used in pairing curves.

mratsim commented 4 years ago

VPADDMULQ (IFMA) is for unsaturated representation isn't it? Are there any benchmarks that show that they are more performant for general primes? I know they are good for special primes (pseudomersenne/generalised mersenne), but as far as I know no such primes are used in pairing curves.

Yes, they are for unsatured.

The best would be to benchmark Milagro/AMCL/MIRACL

I can provide bench for BLS12-381 specifically since I have to create some anyway and this is what we are using as a backend for Eth2 at https://github.com/status-im/nim-blscurve

I explored using lazy reduction and lazy carry for BN254 and BLS12-381 but ultimately I believe it's a dead-end, I explain my analysis in-depth here: https://github.com/mratsim/constantine/issues/15#issuecomment-602036864

Otherwise, I believe there are enough interest and parallel effort that maybe a common Discord/Matrix/Telegram chat for all implementers of optimized pairing libraries and maybe as a first outcome a shootout in the style of:

jon-chuang commented 4 years ago

Hi @mratsim, this seems like a great idea. Would you be willing to setup the benchmarking infrastructure? It seems like there are some issues getting around the different languages and benchmarking suites however. I was also thinking that we can pool some resources like generated assembly and benchmark those on different platforms.

Does milagro use the unsaturated representation for general primes? I have not seen it used in that way? Can unsaturated repr. perform montgomery multiplication for general primes?

I am starting to think that it could be much better to copy some hand-written assembly than to generate/write one's own, it is rather complex and error-prone, and there are many difficult things to do like data movement and register allocation. One can then do a bit of clean up. Another option is to get the GCC compiler to emit the necessary intrinsics for us.

gbotrel commented 4 years ago

@mratsim thanks for the links, very informative!. I'm wondering if using a 2^52 radix representation for larger primes than in BLS381 (for example MNT4/6) would make sense, my hunch is that using VPADDMULQ may be worth it. That being said, you did discourage me to investigate it more, other priorities (changing radix will impact too many parts of our project anyway).

gbotrel commented 4 years ago

@jon-chuang copy existing hand written assembly is not an option for goff: first, Golang uses plan9 assembly, and also, goff is meant to generate code for arbitrary moduli. Here is the (unformatted) bls377 with ADX and MULX if you want to have a look: https://github.com/ConsenSys/goff/blob/asm/examples/bls377/element_mul_amd64.s

Will soon merge, some cleaning up to do: https://github.com/ConsenSys/goff/pull/13

jon-chuang commented 4 years ago

@gbotrel That looks good! It seems one can generate such assembly with inlined moduli movs (saving one register and also memory fetches) by passing the moduli to a macro if one is looking for slightly more flexibility.

However, generalizing to arbitrary number of limbs is probably not possible, because the way one uses registers is different for each case. For higher number of limbs, my guess is the compiler would start using pushq and popq to save some values, or perhaps just movq to save to the memory value. I notice for the second inner loop, you are already using 2 movs in each block.

I think an interesting project could be to extend fiat crypto's approach to assembly.

May I know how you created the assembly? Was it hand-written?

Also, did you get any timings for that piece of assembly code? Would you mind if I adapted it just to benchmark the speed?

jon-chuang commented 4 years ago

Another option, of course, is to movq the moduli from a memory location, wasting one register. I wonder if it could have good performance still compared to using constants.

gbotrel commented 4 years ago

@jon-chuang #13 is now merged. You can benchmark in master:

cd examples/bls377/
go test -run=NONE -bench=ASM -count=3 -benchtime=1s 

as mentioned in the PR, on my machine, it is faster than mcl

mratsim commented 4 years ago

Hi @mratsim, this seems like a great idea. Would you be willing to setup the benchmarking infrastructure? It seems like there are some issues getting around the different languages and benchmarking suites however. I was also thinking that we can pool some resources like generated assembly and benchmark those on different platforms.

I'll have a look over the weekend. I remember when trying to benchmark Milagro/MIRACL, RELIC, Barrentenberg struggling quite a bit to setup the benchmarks. The first 2 because you need to generate C code and then compile the generated code, the last one because the benchmark is hidden quite deep and mixes Google/benchmark report and manual cycle counting in the same stdout.

In any-case I'll setup an organization, repo and vendor some of the librairie I found interesting so that we can get started.

In terms of number, I think ideally library should report throughput (ops/s) and time (ns/op) and as a stretch goal, cycle count on x86.

I've merged a benchmark PR on our Milagro codebase used for Eth 2 yesterday and the reports looks like this: https://github.com/status-im/nim-blscurve/pull/46

$  nim c -d:danger --verbosity:0 --hints:off --warnings:off --outdir:build -r benchmarks/bench_all.nim 
Warmup: 0.9025 s, result 224 (displayed to avoid compiler optimizing warmup away)

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

Scalar multiplication G1                                                2412.667 ops/s        414479 ns/op       1243455 cycles
Scalar multiplication G2                                                 889.198 ops/s       1124609 ns/op       3373869 cycles
EC add G1                                                             912408.759 ops/s          1096 ns/op          3290 cycles
EC add G2                                                             323206.206 ops/s          3094 ns/op          9284 cycles
Pairing (Milagro builtin double pairing)                                 415.093 ops/s       2409099 ns/op       7227390 cycles
Pairing (Multi-Pairing with delayed Miller and Exp)                      417.191 ops/s       2396982 ns/op       7191039 cycles

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                                    941.698 ops/s       1061912 ns/op       3185779 cycles

FYI, mcl is 8x faster on Scalar multiplication on G2 and 3x faster on Pairing: https://github.com/status-im/nim-blscurve/issues/47

jon-chuang commented 4 years ago

@mratsim awesome! I think the key contenders are MCL, Barretenberg, I am not sure why Milagro is still in the picture, it isn't under active development it seems.

mratsim commented 4 years ago

It's interesting for me and many Ethereum2 implementers because everyone used Milagro at one point.

kilic commented 4 years ago

Hi all, I have managed to generate ADX and non ADX backend for an arbitrary modulus up to 1024 bits and had great benchmark results. Please take a look https://github.com/kilic/fp. I am also considering to add the new multiplication trick used with goff.

jon-chuang commented 4 years ago

@kilic That's great work! I've a lot to learn from your project. I benchmarked fp on my 3.6 GHz machine and the timings beat goff.

BenchmarkField/256_mul-8            59775574            18.7 ns/op
BenchmarkField/384_mul-8            30807256            34.9 ns/op
jon-chuang commented 4 years ago

@kilic I've had good results with ASM for limbs <= 7 with the no-carry trick: https://github.com/scipr-lab/zexe/pull/176 I am now working to improve my data movement for more than 7 limbs :) Stay tuned.