Closed jon-chuang closed 3 years ago
By the way, the timings for Barretenberg look wrong. The last time I checked, it was running at 22ns mul and 20ns squaring. Ours look like 25.5ns and 22.5ns.
Also, @mratsim also brought up MCL, which is frighteningly fast. Once Barretenberg's benchmarking is back up to speed, it would be nice to do a proper benchmark and set the records straight. The next step for us is to use macros to generate ASM strings with ADX (mulx, adoc, adox) instructions for broadwell and newer CPUs.
Thanks for that; will add that to our backlog -- when time permits, I was hoping to create a separate repo with proper docker images for each of this libraries benchmarks, not high priority on our side for now though. I suspect few projects in the space are running various benchmarks in various conditions, will be nice to have a "common benchmarking context".
I see. I would have loved to use intrinsics for ADX/MULX, but unfortunately the LLVM and hence rustc compilers don't emit it, so we have to resort to manipulating strings. Out of curiousity, is there intrinsics support in the go language/compiler?
Yes, and that's why we didn't need to write assembly so far: https://dave.cheney.net/2019/08/20/go-compiler-intrinsics .
However, the current instrisics don't generate ADX/MULX, so to go that route (or the more crazy VPMADD52LUQ) we will likely resort to assembly for the Mul.
VPAADDMULQ (IFMA) is for unsaturated representation isn't it? Are there any benchmarks that show that they are more performant for general primes? I know they are good for special primes (pseudomersenne/generalised mersenne), but as far as I know no such primes are used in pairing curves.
VPADDMULQ (IFMA) is for unsaturated representation isn't it? Are there any benchmarks that show that they are more performant for general primes? I know they are good for special primes (pseudomersenne/generalised mersenne), but as far as I know no such primes are used in pairing curves.
Yes, they are for unsatured.
The best would be to benchmark Milagro/AMCL/MIRACL
I can provide bench for BLS12-381 specifically since I have to create some anyway and this is what we are using as a backend for Eth2 at https://github.com/status-im/nim-blscurve
I explored using lazy reduction and lazy carry for BN254 and BLS12-381 but ultimately I believe it's a dead-end, I explain my analysis in-depth here: https://github.com/mratsim/constantine/issues/15#issuecomment-602036864
Otherwise, I believe there are enough interest and parallel effort that maybe a common Discord/Matrix/Telegram chat for all implementers of optimized pairing libraries and maybe as a first outcome a shootout in the style of:
Hi @mratsim, this seems like a great idea. Would you be willing to setup the benchmarking infrastructure? It seems like there are some issues getting around the different languages and benchmarking suites however. I was also thinking that we can pool some resources like generated assembly and benchmark those on different platforms.
Does milagro use the unsaturated representation for general primes? I have not seen it used in that way? Can unsaturated repr. perform montgomery multiplication for general primes?
I am starting to think that it could be much better to copy some hand-written assembly than to generate/write one's own, it is rather complex and error-prone, and there are many difficult things to do like data movement and register allocation. One can then do a bit of clean up. Another option is to get the GCC compiler to emit the necessary intrinsics for us.
@mratsim thanks for the links, very informative!. I'm wondering if using a 2^52 radix representation for larger primes than in BLS381 (for example MNT4/6) would make sense, my hunch is that using VPADDMULQ may be worth it. That being said, you did discourage me to investigate it more, other priorities (changing radix will impact too many parts of our project anyway).
@jon-chuang copy existing hand written assembly is not an option for goff
: first, Golang uses plan9 assembly, and also, goff
is meant to generate code for arbitrary moduli.
Here is the (unformatted) bls377 with ADX and MULX if you want to have a look: https://github.com/ConsenSys/goff/blob/asm/examples/bls377/element_mul_amd64.s
Will soon merge, some cleaning up to do: https://github.com/ConsenSys/goff/pull/13
@gbotrel That looks good! It seems one can generate such assembly with inlined moduli movs (saving one register and also memory fetches) by passing the moduli to a macro if one is looking for slightly more flexibility.
However, generalizing to arbitrary number of limbs is probably not possible, because the way one uses registers is different for each case. For higher number of limbs, my guess is the compiler would start using pushq and popq to save some values, or perhaps just movq to save to the memory value. I notice for the second inner loop, you are already using 2 movs in each block.
I think an interesting project could be to extend fiat crypto's approach to assembly.
May I know how you created the assembly? Was it hand-written?
Also, did you get any timings for that piece of assembly code? Would you mind if I adapted it just to benchmark the speed?
Another option, of course, is to movq the moduli from a memory location, wasting one register. I wonder if it could have good performance still compared to using constants.
@jon-chuang #13 is now merged. You can benchmark in master
:
cd examples/bls377/
go test -run=NONE -bench=ASM -count=3 -benchtime=1s
as mentioned in the PR, on my machine, it is faster than mcl
Hi @mratsim, this seems like a great idea. Would you be willing to setup the benchmarking infrastructure? It seems like there are some issues getting around the different languages and benchmarking suites however. I was also thinking that we can pool some resources like generated assembly and benchmark those on different platforms.
I'll have a look over the weekend. I remember when trying to benchmark Milagro/MIRACL, RELIC, Barrentenberg struggling quite a bit to setup the benchmarks. The first 2 because you need to generate C code and then compile the generated code, the last one because the benchmark is hidden quite deep and mixes Google/benchmark report and manual cycle counting in the same stdout.
In any-case I'll setup an organization, repo and vendor some of the librairie I found interesting so that we can get started.
In terms of number, I think ideally library should report throughput (ops/s) and time (ns/op) and as a stretch goal, cycle count on x86.
I've merged a benchmark PR on our Milagro codebase used for Eth 2 yesterday and the reports looks like this: https://github.com/status-im/nim-blscurve/pull/46
$ nim c -d:danger --verbosity:0 --hints:off --warnings:off --outdir:build -r benchmarks/bench_all.nim
Warmup: 0.9025 s, result 224 (displayed to avoid compiler optimizing warmup away)
⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz
Scalar multiplication G1 2412.667 ops/s 414479 ns/op 1243455 cycles
Scalar multiplication G2 889.198 ops/s 1124609 ns/op 3373869 cycles
EC add G1 912408.759 ops/s 1096 ns/op 3290 cycles
EC add G2 323206.206 ops/s 3094 ns/op 9284 cycles
Pairing (Milagro builtin double pairing) 415.093 ops/s 2409099 ns/op 7227390 cycles
Pairing (Multi-Pairing with delayed Miller and Exp) 417.191 ops/s 2396982 ns/op 7191039 cycles
⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
This is an outdated draft.
Hash to G2 (Draft #5) 941.698 ops/s 1061912 ns/op 3185779 cycles
FYI, mcl is 8x faster on Scalar multiplication on G2 and 3x faster on Pairing: https://github.com/status-im/nim-blscurve/issues/47
@mratsim awesome! I think the key contenders are MCL, Barretenberg, I am not sure why Milagro is still in the picture, it isn't under active development it seems.
It's interesting for me and many Ethereum2 implementers because everyone used Milagro at one point.
Hi all, I have managed to generate ADX and non ADX backend for an arbitrary modulus up to 1024 bits and had great benchmark results. Please take a look https://github.com/kilic/fp. I am also considering to add the new multiplication trick used with goff.
@kilic That's great work! I've a lot to learn from your project. I benchmarked fp on my 3.6 GHz machine and the timings beat goff.
BenchmarkField/256_mul-8 59775574 18.7 ns/op
BenchmarkField/384_mul-8 30807256 34.9 ns/op
@kilic I've had good results with ASM for limbs <= 7 with the no-carry trick: https://github.com/scipr-lab/zexe/pull/176 I am now working to improve my data movement for more than 7 limbs :) Stay tuned.
Hi, I would like to invite the goff team to update their benchmark timings against zexe's new macro-based FF implementation. We used goff's no carry optimisation. I reference a relevant issue here: https://github.com/scipr-lab/zexe/issues/162.
My benchmarks show that zexe's original squaring implementation is faster (42.5ns vs 46.5ns) than goff's and on par for multiplication (50.2ns vs 50.3ns for goff on my machine).
To verify, please compile my fork with the appropriately modified benchmarks in the folder
algebra-benches
withRUSTFLAGS="--emit=asm" cargo +nightly bench
.