Open real-or-random opened 1 year ago
Hi everyone, It took me a bit but I now have some numbers.
I've used those nine machines | CPU | 𝜇-architecture |
---|---|---|
Intel Core i7-6770HQ | Skylake-H | |
Intel Core i7-10710U | Comet Lake-U | |
Intel Core i9-10900K | Comet Lake-S | |
Intel Core i7-11700KF | Rocket Lake-S | |
Intel Core i9-13900KF | Raptor Lake-S | |
AMD Ryzen Theadr. 1900X | Zen 1 | |
AMD Ryzen 7 5800X | Zen 3 | |
AMD Ryzen 9 5950X | Zen 3 | |
AMD Ryzen 9 7950X | Zen 4 |
I've created a little project here, which takes this library as the base.
For default_asm
, it compiles with ./configure --with-asm=x86_64
using the current asm implementation.
For default_c
, it compiles with ./configure --with-asm=no
using the current c implementation.
For default_c52
, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no
. I've included that to compare that with fiat_c.
For fiat_c
, it replaces the c implementation with the Fiat Cryptography implementation (which does not feature the #810 optimization.)
For fiat_cryptopt
, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I've included the 'on average best' one.
So on average, we see that for the ecmult_gen
the fiat_c
is more performant than the default_asm
and fiat_cryptopt
is even faster than that and the current default_c
.
Numbers for the secp256k1
benchmarks ./bench_internal simple
and ./bench_ecmult
(I've commented out the multi_
benches to succeed in a timely manner.)
implementation | default_asm | default_c | default_c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 24.6435 | 23.2049 | 24.7301 | 24.1802 | 22.8905 |
ecmult_const | 46.798 | 43.8816 | 47.4836 | 45.7434 | 42.6387 |
ecmult_1p | 37.0311 | 34.1909 | 37.06 | 35.7762 | 33.1715 |
ecmult_0p_g | 25.3775 | 23.7174 | 26.1944 | 25.0963 | 23.3779 |
ecmult_1p_g | 21.6629 | 19.9417 | 21.5685 | 20.8703 | 19.412 |
field_half | 0.00333343 | 0.00333149 | 0.00334944 | 0.00342343 | 0.0033374 |
field_normalize | 0.0112975 | 0.0113496 | 0.0113459 | 0.0112799 | 0.0112925 |
field_normalize_weak | 0.00447881 | 0.00451408 | 0.00449407 | 0.00450251 | 0.00453795 |
field_sqr | 0.0215233 | 0.0194716 | 0.0213214 | 0.0189657 | 0.0190589 |
field_mul | 0.0258206 | 0.0219956 | 0.0244374 | 0.0240141 | 0.0226145 |
field_inverse | 2.13491 | 2.13158 | 2.13192 | 2.13569 | 2.14261 |
field_inverse_var | 1.39323 | 1.39281 | 1.3899 | 1.39515 | 1.39883 |
field_is_square_var | 1.82198 | 1.83353 | 1.84074 | 1.83197 | 1.82702 |
field_sqrt | 5.94338 | 5.4591 | 5.88598 | 5.29186 | 5.28916 |
In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt
is again the most performant one.
Implementation | Lang. | 1900X | 5800X | 5950X | 7950X | i7 6G | i7 10G | i9 10G | i7 11G | i9 13G | G.M. |
---|---|---|---|---|---|---|---|---|---|---|---|
(default_asm ) libsecp256k1 [Bitcoin Core 2021] |
asm | 720k (1.07x) | 567k (1.05x) | 567k (1.05x) | 560k (1.06x) | 565k (1.06x) | 564k (1.06x) | 564k (1.06x) | 539k (1.08x) | 439k (1.06x) | 561k (1.06x) |
libsecp256k1+Best (Opt) | asm | 712k (1.06x) | 545k (1.01x) | 545k (1.01x) | 543k (1.02x) | 541k (1.01x) | 541k (1.02x) | 541k (1.02x) | 512k (1.03x) | 448k (1.08x) | 544k (1.02x) |
(default_c52 ) libsecp256k1 [Bitcoin Core 2021] (w/o 810) |
C | 699k (1.04x) | 561k (1.04x) | 559k (1.04x) | 546k (1.03x) | 575k (1.08x) | 574k (1.08x) | 572k (1.08x) | 538k (1.08x) | 454k (1.09x) | 561k (1.06x) |
(default_c ) libsecp256k1 [Bitcoin Core 2021] (w/ 810) |
C | 671k (1.00x) | 538k (1.00x) | 538k (1.00x) | 530k (1.00x) | 561k (1.05x) | 563k (1.06x) | 562k (1.06x) | 508k (1.02x) | 435k (1.05x) | 542k (1.02x) |
(fiat_c ) libsecp256k1 + Fiat-C |
C | 671k (1.00x) | 541k (1.00x) | 540k (1.00x) | 537k (1.01x) | 551k (1.03x) | 551k (1.04x) | 550k (1.04x) | 508k (1.02x) | 425k (1.02x) | 538k (1.01x) |
(fiat_cryptopt ) libsecp256k1+Best (Fiat) |
asm | 703k (1.05x) | 539k (1.00x) | 539k (1.00x) | 536k (1.01x) | 533k (1.00x) | 530k (1.00x) | 531k (1.00x) | 498k (1.00x) | 415k (1.00x) | 532k (1.00x) |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 29.3 | 29.1 | 30.2 | 30.1 | 27.1 |
ecmult_const | 54.9 | 55.0 | 57.6 | 57.0 | 49.6 |
ecmult_1p | 42.8 | 42.7 | 44.8 | 44.3 | 38.3 |
ecmult_0p_g | 29.3 | 29.6 | 31.7 | 31.3 | 27.0 |
ecmult_1p_g | 25.1 | 24.9 | 26.1 | 25.9 | 22.5 |
field_half | 0.00425 | 0.00425 | 0.00429 | 0.00429 | 0.00424 |
field_normalize | 0.0143 | 0.0144 | 0.0145 | 0.0143 | 0.0144 |
field_normalize_weak | 0.00566 | 0.00585 | 0.00566 | 0.00566 | 0.00584 |
field_sqr | 0.0243 | 0.0237 | 0.0242 | 0.0218 | 0.0214 |
field_mul | 0.0302 | 0.0266 | 0.0285 | 0.0305 | 0.0253 |
field_inverse | 2.70 | 2.69 | 2.69 | 2.70 | 2.68 |
field_inverse_var | 1.82 | 1.80 | 1.80 | 1.80 | 1.80 |
field_is_square_var | 2.31 | 2.34 | 2.36 | 2.30 | 2.32 |
field_sqrt | 6.79 | 6.50 | 6.58 | 6.10 | 5.87 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 70.0 | 69.3 | 72.1 | 71.7 | 64.7 |
ecmult_const | 131.0 | 131.0 | 137.0 | 136.0 | 118.0 |
ecmult_1p | 102.0 | 102.0 | 107.0 | 106.0 | 91.5 |
ecmult_0p_g | 70.0 | 70.9 | 74.7 | 73.6 | 63.4 |
ecmult_1p_g | 59.9 | 59.6 | 62.4 | 61.8 | 53.8 |
field_half | 0.00997 | 0.00993 | 0.0101 | 0.0105 | 0.00994 |
field_normalize | 0.0344 | 0.0344 | 0.0344 | 0.0344 | 0.0345 |
field_normalize_weak | 0.0132 | 0.0137 | 0.0132 | 0.0132 | 0.0137 |
field_sqr | 0.0576 | 0.0559 | 0.0572 | 0.0515 | 0.0507 |
field_mul | 0.0714 | 0.0629 | 0.0722 | 0.0676 | 0.0599 |
field_inverse | 6.16 | 6.16 | 6.11 | 6.15 | 6.15 |
field_inverse_var | 4.33 | 4.32 | 4.32 | 4.36 | 4.32 |
field_is_square_var | 5.65 | 5.59 | 5.63 | 5.68 | 5.59 |
field_sqrt | 16.2 | 15.5 | 15.7 | 14.5 | 14.0 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 20.5 | 20.4 | 21.1 | 21.1 | 19.0 |
ecmult_const | 38.4 | 38.5 | 40.3 | 39.8 | 34.7 |
ecmult_1p | 30.0 | 29.9 | 31.3 | 31.0 | 26.8 |
ecmult_0p_g | 20.6 | 20.7 | 21.8 | 21.5 | 18.5 |
ecmult_1p_g | 17.6 | 17.4 | 18.3 | 18.1 | 15.7 |
field_half | 0.00296 | 0.00296 | 0.00297 | 0.00293 | 0.00298 |
field_normalize | 0.0101 | 0.0101 | 0.0101 | 0.0101 | 0.0101 |
field_normalize_weak | 0.00393 | 0.00393 | 0.00392 | 0.00393 | 0.00392 |
field_sqr | 0.0171 | 0.0166 | 0.0171 | 0.0153 | 0.0151 |
field_mul | 0.0212 | 0.0187 | 0.0201 | 0.0201 | 0.0178 |
field_inverse | 1.80 | 1.80 | 1.80 | 1.80 | 1.80 |
field_inverse_var | 1.27 | 1.27 | 1.27 | 1.27 | 1.26 |
field_is_square_var | 1.62 | 1.65 | 1.64 | 1.62 | 1.62 |
field_sqrt | 4.75 | 4.54 | 4.62 | 4.26 | 4.10 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 19.9 | 17.8 | 19.3 | 18.8 | 17.8 |
ecmult_const | 38.3 | 34.2 | 37.1 | 35.4 | 33.4 |
ecmult_1p | 29.9 | 26.4 | 28.9 | 27.3 | 26.0 |
ecmult_0p_g | 20.9 | 18.4 | 20.1 | 18.8 | 18.1 |
ecmult_1p_g | 17.5 | 15.5 | 16.7 | 15.9 | 15.3 |
field_half | 0.00304 | 0.00303 | 0.00303 | 0.00303 | 0.00305 |
field_normalize | 0.0102 | 0.0105 | 0.0104 | 0.0101 | 0.0101 |
field_normalize_weak | 0.00394 | 0.00394 | 0.00410 | 0.00415 | 0.00416 |
field_sqr | 0.0195 | 0.0171 | 0.0181 | 0.0170 | 0.0164 |
field_mul | 0.0225 | 0.0182 | 0.0191 | 0.0196 | 0.0203 |
field_inverse | 1.89 | 1.89 | 1.89 | 1.89 | 1.97 |
field_inverse_var | 1.22 | 1.22 | 1.21 | 1.22 | 1.25 |
field_is_square_var | 1.67 | 1.67 | 1.68 | 1.67 | 1.68 |
field_sqrt | 5.35 | 4.69 | 5.12 | 4.51 | 4.65 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 19.5 | 17.8 | 18.6 | 17.9 | 17.6 |
ecmult_const | 38.4 | 34.3 | 36.2 | 34.6 | 33.8 |
ecmult_1p | 30.7 | 27.3 | 28.8 | 27.5 | 26.8 |
ecmult_0p_g | 21.4 | 18.8 | 20.0 | 19.1 | 18.6 |
ecmult_1p_g | 18.0 | 16.0 | 16.8 | 16.1 | 15.7 |
field_half | 0.00330 | 0.00332 | 0.00337 | 0.00335 | 0.00333 |
field_normalize | 0.0109 | 0.0110 | 0.0110 | 0.0109 | 0.0109 |
field_normalize_weak | 0.00452 | 0.00451 | 0.00449 | 0.00450 | 0.00451 |
field_sqr | 0.0191 | 0.0169 | 0.0193 | 0.0167 | 0.0177 |
field_mul | 0.0256 | 0.0202 | 0.0210 | 0.0196 | 0.0191 |
field_inverse | 2.13 | 2.12 | 2.13 | 2.13 | 2.13 |
field_inverse_var | 1.37 | 1.37 | 1.37 | 1.36 | 1.38 |
field_is_square_var | 1.88 | 1.87 | 1.87 | 1.88 | 1.88 |
field_sqrt | 5.43 | 4.97 | 5.56 | 4.82 | 5.04 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 25.7 | 23.5 | 26.0 | 24.4 | 25.1 |
ecmult_const | 50.9 | 45.9 | 51.4 | 47.4 | 48.5 |
ecmult_1p | 39.6 | 35.5 | 40.0 | 37.1 | 37.1 |
ecmult_0p_g | 27.5 | 24.7 | 28.5 | 26.3 | 26.6 |
ecmult_1p_g | 23.0 | 20.6 | 23.2 | 21.6 | 21.5 |
field_half | 0.00270 | 0.00270 | 0.00271 | 0.00293 | 0.00269 |
field_normalize | 0.0104 | 0.0104 | 0.0104 | 0.0104 | 0.0104 |
field_normalize_weak | 0.00365 | 0.00365 | 0.00365 | 0.00365 | 0.00365 |
field_sqr | 0.0216 | 0.0202 | 0.0224 | 0.0198 | 0.0202 |
field_mul | 0.0267 | 0.0235 | 0.0263 | 0.0246 | 0.0244 |
field_inverse | 2.04 | 2.04 | 2.05 | 2.05 | 2.04 |
field_inverse_var | 1.21 | 1.22 | 1.21 | 1.22 | 1.21 |
field_is_square_var | 1.51 | 1.50 | 1.53 | 1.51 | 1.56 |
field_sqrt | 5.96 | 5.60 | 6.10 | 5.50 | 5.54 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 20.6 | 19.2 | 20.7 | 20.3 | 19.3 |
ecmult_const | 38.2 | 35.2 | 39.3 | 37.6 | 35.5 |
ecmult_1p | 31.4 | 27.5 | 30.6 | 29.5 | 27.7 |
ecmult_0p_g | 20.9 | 19.1 | 21.6 | 20.5 | 19.4 |
ecmult_1p_g | 18.4 | 16.0 | 17.8 | 17.2 | 16.2 |
field_half | 0.00252 | 0.00251 | 0.00251 | 0.00264 | 0.00251 |
field_normalize | 0.00818 | 0.00818 | 0.00818 | 0.00819 | 0.00818 |
field_normalize_weak | 0.00343 | 0.00343 | 0.00343 | 0.00343 | 0.00343 |
field_sqr | 0.0180 | 0.0147 | 0.0177 | 0.0152 | 0.0157 |
field_mul | 0.0199 | 0.0167 | 0.0194 | 0.0195 | 0.0191 |
field_inverse | 1.58 | 1.58 | 1.58 | 1.58 | 1.58 |
field_inverse_var | 1.03 | 1.03 | 1.03 | 1.03 | 1.03 |
field_is_square_var | 1.37 | 1.38 | 1.38 | 1.38 | 1.36 |
field_sqrt | 4.72 | 4.23 | 4.83 | 4.30 | 4.35 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 23.1 | 21.4 | 23.2 | 22.7 | 21.6 |
ecmult_const | 42.6 | 39.3 | 43.9 | 42.0 | 39.6 |
ecmult_1p | 34.9 | 30.7 | 34.1 | 32.9 | 30.9 |
ecmult_0p_g | 23.1 | 21.2 | 24.7 | 23.4 | 22.3 |
ecmult_1p_g | 20.4 | 17.9 | 19.9 | 19.2 | 18.1 |
field_half | 0.00280 | 0.00280 | 0.00280 | 0.00294 | 0.00280 |
field_normalize | 0.00914 | 0.00912 | 0.00913 | 0.00913 | 0.00913 |
field_normalize_weak | 0.00382 | 0.00383 | 0.00382 | 0.00382 | 0.00382 |
field_sqr | 0.0200 | 0.0164 | 0.0197 | 0.0169 | 0.0176 |
field_mul | 0.0221 | 0.0188 | 0.0216 | 0.0216 | 0.0213 |
field_inverse | 1.77 | 1.76 | 1.76 | 1.77 | 1.77 |
field_inverse_var | 1.15 | 1.15 | 1.15 | 1.15 | 1.18 |
field_is_square_var | 1.52 | 1.54 | 1.54 | 1.54 | 1.52 |
field_sqrt | 5.26 | 4.71 | 5.39 | 4.80 | 4.85 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.8 | 15.5 | 16.8 | 16.4 | 15.8 |
ecmult_const | 32.0 | 29.2 | 32.4 | 31.0 | 29.8 |
ecmult_1p | 25.1 | 22.7 | 25.3 | 24.4 | 23.4 |
ecmult_0p_g | 17.4 | 15.8 | 18.4 | 17.6 | 17.0 |
ecmult_1p_g | 14.6 | 13.2 | 14.7 | 14.2 | 13.6 |
field_half | 0.00212 | 0.00212 | 0.00212 | 0.00212 | 0.00213 |
field_normalize | 0.00698 | 0.00697 | 0.00696 | 0.00695 | 0.00696 |
field_normalize_weak | 0.00290 | 0.00290 | 0.00290 | 0.00290 | 0.00290 |
field_sqr | 0.0143 | 0.0130 | 0.0141 | 0.0128 | 0.0125 |
field_mul | 0.0165 | 0.0142 | 0.0170 | 0.0161 | 0.0149 |
field_inverse | 1.34 | 1.34 | 1.34 | 1.34 | 1.34 |
field_inverse_var | 0.825 | 0.827 | 0.825 | 0.838 | 0.825 |
field_is_square_var | 1.06 | 1.09 | 1.09 | 1.09 | 1.06 |
field_sqrt | 4.12 | 3.62 | 3.93 | 3.50 | 3.51 |
In regards to merging this, In the I've included the implementations as .c
files. And the methods are not static.
In the case for the libsecp-benchmarks
I've added the implementation as a c
-source to the bench{,_internal,_ecmult}_SOURCES
targets (see here).
Side note, I wanted to include the implementations as .asm
files first (as those are formally verified) but then I think we would be in trouble once we compile for non System V- ABI's. So I've used nasm
to assemble the asm
files and objdump
to dump the code in at&t syntax (afaik, otherwise clang complains sometimes) and include them into the c
-source.
With that, I hope the compiler is able (if needed) swap registers of the arguments if one compiles for non System V.
And when I tried to include the methods as inline static
in .h
files, any optimization-level of the compiler resulted in segfaults, because the registers were not set correctly. (I'm too unfamiliar with this inline-asm notation).
In regards to "tell cryptopt to emit general code (i.e. no mulx
, adox
, adcx
) That is not too trivial. I'll see what I can do, but I don't think we should expect anything here too soon.
Thanks, you have a nice benchmark suite. :) Interestingly, fiat_c
beats default_c
on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?
Anyway, I don't think that any of this should change the plan sketched above. Since we're not in a hurry, it would be awesome to have https://github.com/mit-plv/fiat-crypto/issues/1582 solved first.
In regards to "tell cryptopt to emit general code (i.e. no mulx, adox, adcx) That is not too trivial. I'll see what I can do, but I don't think we should expect anything here too soon.
Oh okay, we assumed it's a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.
fiat_c beats default_c on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?
SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{}
I've just used the default flags (see below output of ./configure
). I'm happy to do that with other compilers, too, if you think that's useful.
Oh okay, we assumed it's a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.
I'll keep it in mind. I'm a bit divided on this though. CPUs since Oct '14 (Broadwell) support BMI2 / ADX. So it feel's wrong to not use them. Would be interesting to see what the usage of this library is distributed... On the other hand I do understand to be as flexible as possible.
I'm using Clang 15 and GCC 12.
gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE
clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -Os -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O2 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
./configure
on i7-11700K, ./fiat_cryptopt
Build Options:
with external callbacks = no
with benchmarks = yes
with tests = yes
with ctime tests = yes
with coverage = no
with examples = no
module ecdh = yes
module recovery = no
module extrakeys = yes
module schnorrsig = yes
asm = x86_64
ecmult window size = 15
ecmult gen prec. bits = 4
valgrind = yes
CC = gcc
CPPFLAGS =
SECP_CFLAGS = -O2 -std=c89 -pedantic -Wno-long-long -Wnested-externs -Wshadow -Wstrict-prototypes -Wundef -Wno-overlength-strings -Wall -Wno-unused-function -Wextra -Wcast-align -Wcast-align=strict -fvisibility=hidden
CFLAGS = -g -O2
LDFLAGS =
SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{} I've just used the default flags (see below output of ./configure).
Makes sense.
I'm happy to do that with other compilers, too, if you think that's useful.
I'm not convinced that it will provide much more insight right now.
CPUs since Oct '14 (Broadwell) support BMI2 / ADX. So it feel's wrong to not use them. Would be interesting to see what the usage of this library is distributed...
Yep, exactly my thoughts. If you really have an older CPU, the C code will serve you well.
Concept ACK on investigating fiat-crypto and (longer term) cryptopt. Would be great to have a verified implementation that's only slightly slower or even faster.
Thanks @dderjoel for the benchmarks.
Owen implemented #810 in the Fiat repo (see PR https://github.com/mit-plv/fiat-crypto/issues/1582), it's not yet fully merged, but I could not wait to benchmark it.
I've updated secp_bench and supercop with the new C and CryptOpt-optimized versions.
I've also added our 12th Gen machine back into the benchmark suite, here the updated numbers (copied post from above, indicating changes):
I've used those ~nine~ ten machines | CPU | 𝜇-architecture |
---|---|---|
Intel Core i7-6770HQ | Skylake-H | |
Intel Core i7-10710U | Comet Lake-U | |
Intel Core i9-10900K | Comet Lake-S | |
Intel Core i7-11700KF | Rocket Lake-S | |
Intel Core i9-12900KF | Alder Lake-S | |
Intel Core i9-13900KF | Raptor Lake-S | |
AMD Ryzen Theadr. 1900X | Zen 1 | |
AMD Ryzen 7 5800X | Zen 3 | |
AMD Ryzen 9 5950X | Zen 3 | |
AMD Ryzen 9 7950X | Zen 4 |
I've created a little project here, which takes this library as the base.
For default_asm
, it compiles with ./configure --with-asm=x86_64
using the current asm implementation.
For default_c
, it compiles with ./configure --with-asm=no
using the current c implementation.
For default_c52
, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no
. I've included that to compare that with fiat_c.
For fiat_c
, it replaces the c implementation with the ~Fiat Cryptography implementation (which does not feature the #810 optimization.)~ the Merge-in-progess Fiat Cryptography implementation, now featuring the #810 optimization.
For fiat_cryptopt
, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I've included the 'on average best' one. (implicitly also now using the optimization)
So on average, we see that for the ecmult_gen
the fiat_c
is more performant than the default_asm
and slightly more performant than default_c
(plus giving formal assurance) and fiat_cryptopt
is ~even faster than that and the current default_c
.~ the fastest (plus formal assurance) .
Numbers for the secp256k1
benchmarks ./bench_internal simple
and ./bench_ecmult
(I've commented out the multi_
benches to succeed in a timely manner.)
implementation | default_asm | default_c | default_c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.1257 | 14.9874 | 15.8764 | 14.9714 | 14.1071 |
ecmult_const | 30.3317 | 28.2041 | 30.3634 | 28.1846 | 26.1778 |
ecmult_1p | 23.9575 | 22.0809 | 23.641 | 22.1608 | 20.5615 |
ecmult_0p_g | 16.9942 | 15.8604 | 16.9429 | 15.9223 | 14.7684 |
ecmult_1p_g | 14.0591 | 12.8636 | 13.7521 | 12.9906 | 12.0012 |
field_half | 0.0024964 | 0.0024537 | 0.00237626 | 0.0023697 | 0.00237259 |
field_normalize | 0.00737954 | 0.00744617 | 0.00736519 | 0.00732155 | 0.00733252 |
field_normalize_weak | 0.00294988 | 0.00295974 | 0.00293822 | 0.00295422 | 0.00294591 |
field_sqr | 0.014059 | 0.0126383 | 0.0137789 | 0.0123009 | 0.0119273 |
field_mul | 0.0170241 | 0.0144868 | 0.0156579 | 0.014689 | 0.0144257 |
field_inverse | 1.40048 | 1.40896 | 1.39608 | 1.38928 | 1.3986 |
field_inverse_var | 0.915942 | 0.919022 | 0.912739 | 0.907769 | 0.910547 |
field_is_square_var | 1.20313 | 1.21123 | 1.20451 | 1.19417 | 1.19385 |
field_sqrt | 3.87547 | 3.56868 | 3.82354 | 3.44445 | 3.35331 |
In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt
is again the most performant one. (In fact everywhere except on 1900X, where the Fiat-C is the most performant one)
Implementation | Lang. | 1900X | 5800X | 5950X | 7950X | i7 6G | i7 10G | i9 10G | i7 11G | i9 12G | i9 13G | G.M. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
(default_asm ) libsecp256k1 [Bitcoin Core 2021] |
asm | 719k (1.11x) | 571k (1.11x) | 567k (1.10x) | 560k (1.11x) | 565k (1.07x) | 565k (1.08x) | 564k (1.08x) | 540k (1.13x) | 438k (1.09x) | 439k (1.11x) | 548k (1.09x) |
libsecp256k1+Best (Opt) | asm | 712k (1.10x) | 543k (1.06x) | 545k (1.06x) | 544k (1.08x) | 541k (1.03x) | 541k (1.03x) | 542k (1.04x) | 512k (1.07x) | 449k (1.12x) | 447k (1.13x) | 533k (1.07x) |
(default_c52 ) libsecp256k1 [Bitcoin Core 2021] (w/o 810) |
C | 699k (1.08x) | 560k (1.09x) | 561k (1.09x) | 547k (1.09x) | 574k (1.09x) | 574k (1.10x) | 573k (1.10x) | 536k (1.12x) | 459k (1.14x) | 456k (1.15x) | 550k (1.10x) |
(default_c ) libsecp256k1 [Bitcoin Core 2021] (w/ 810) |
C | 671k (1.04x) | 538k (1.05x) | 537k (1.04x) | 531k (1.06x) | 567k (1.08x) | 563k (1.07x) | 562k (1.08x) | 504k (1.06x) | 440k (1.09x) | 439k (1.11x) | 531k (1.06x) |
(fiat_c ) libsecp256k1 + Fiat-C |
C | 648k (1.00x) | 537k (1.05x) | 536k (1.04x) | 536k (1.07x) | 549k (1.04x) | 546k (1.04x) | 548k (1.05x) | 492k (1.03x) | 418k (1.04x) | 418k (1.05x) | 519k (1.04x) |
(fiat_cryptopt ) libsecp256k1+Best (Fiat) |
asm | 675k (1.04x) | 514k (1.00x) | 515k (1.00x) | 503k (1.00x) | 526k (1.00x) | 524k (1.00x) | 523k (1.00x) | 477k (1.00x) | 402k (1.00x) | 397k (1.00x) | 500k (1.00x) |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 21.9 | 21.6 | 22.4 | 22.2 | 20.0 |
ecmult_const | 40.7 | 40.8 | 42.7 | 41.7 | 36.4 |
ecmult_1p | 31.8 | 31.7 | 33.2 | 32.4 | 28.1 |
ecmult_0p_g | 22.7 | 22.9 | 24.0 | 23.4 | 20.2 |
ecmult_1p_g | 18.6 | 18.5 | 19.4 | 18.9 | 16.5 |
field_half | 0.00470 | 0.00441 | 0.00364 | 0.00436 | 0.00378 |
field_normalize | 0.0106 | 0.0108 | 0.0107 | 0.0107 | 0.0107 |
field_normalize_weak | 0.00420 | 0.00420 | 0.00421 | 0.00422 | 0.00421 |
field_sqr | 0.0184 | 0.0175 | 0.0180 | 0.0169 | 0.0162 |
field_mul | 0.0224 | 0.0198 | 0.0213 | 0.0221 | 0.0181 |
field_inverse | 1.99 | 2.00 | 2.01 | 1.99 | 2.00 |
field_inverse_var | 1.36 | 1.34 | 1.34 | 1.34 | 1.34 |
field_is_square_var | 1.73 | 1.74 | 1.74 | 1.72 | 1.73 |
field_sqrt | 5.04 | 4.83 | 4.90 | 4.66 | 4.44 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 19.8 | 19.3 | 20.1 | 19.7 | 17.8 |
ecmult_const | 35.8 | 35.7 | 37.4 | 36.4 | 31.9 |
ecmult_1p | 28.1 | 28.2 | 29.4 | 28.6 | 24.9 |
ecmult_0p_g | 20.3 | 20.6 | 21.4 | 21.0 | 18.2 |
ecmult_1p_g | 16.5 | 16.4 | 17.2 | 16.8 | 14.6 |
field_half | 0.00327 | 0.00370 | 0.00356 | 0.00280 | 0.00368 |
field_normalize | 0.00956 | 0.00956 | 0.00932 | 0.00954 | 0.00932 |
field_normalize_weak | 0.00373 | 0.00373 | 0.00364 | 0.00387 | 0.00364 |
field_sqr | 0.0167 | 0.0157 | 0.0157 | 0.0151 | 0.0144 |
field_mul | 0.0201 | 0.0177 | 0.0188 | 0.0186 | 0.0163 |
field_inverse | 1.71 | 1.72 | 1.72 | 1.71 | 1.72 |
field_inverse_var | 1.21 | 1.21 | 1.21 | 1.22 | 1.22 |
field_is_square_var | 1.60 | 1.54 | 1.56 | 1.57 | 1.56 |
field_sqrt | 4.50 | 4.22 | 4.30 | 4.09 | 3.90 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 14.9 | 14.8 | 15.3 | 15.1 | 13.6 |
ecmult_const | 27.9 | 27.9 | 29.5 | 28.5 | 24.9 |
ecmult_1p | 21.8 | 21.7 | 22.7 | 22.2 | 19.3 |
ecmult_0p_g | 15.0 | 15.2 | 15.8 | 15.4 | 13.3 |
ecmult_1p_g | 12.8 | 12.6 | 13.3 | 13.0 | 11.3 |
field_half | 0.00222 | 0.00218 | 0.00218 | 0.00215 | 0.00218 |
field_normalize | 0.00739 | 0.00731 | 0.00731 | 0.00733 | 0.00731 |
field_normalize_weak | 0.00285 | 0.00285 | 0.00286 | 0.00285 | 0.00285 |
field_sqr | 0.0125 | 0.0120 | 0.0123 | 0.0115 | 0.0111 |
field_mul | 0.0154 | 0.0136 | 0.0145 | 0.0142 | 0.0125 |
field_inverse | 1.31 | 1.31 | 1.31 | 1.31 | 1.31 |
field_inverse_var | 0.920 | 0.920 | 0.919 | 0.920 | 0.917 |
field_is_square_var | 1.18 | 1.19 | 1.19 | 1.17 | 1.17 |
field_sqrt | 3.44 | 3.30 | 3.35 | 3.18 | 3.04 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 17.1 | 14.4 | 15.1 | 13.1 | 12.7 |
ecmult_const | 31.5 | 27.3 | 28.0 | 24.5 | 23.5 |
ecmult_1p | 23.7 | 21.2 | 20.6 | 19.5 | 19.5 |
ecmult_0p_g | 16.6 | 14.7 | 14.4 | 14.3 | 13.6 |
ecmult_1p_g | 14.0 | 12.2 | 12.1 | 11.6 | 11.4 |
field_half | 0.00239 | 0.00237 | 0.00240 | 0.00223 | 0.00227 |
field_normalize | 0.00779 | 0.00837 | 0.00805 | 0.00730 | 0.00740 |
field_normalize_weak | 0.00320 | 0.00327 | 0.00325 | 0.00296 | 0.00313 |
field_sqr | 0.0155 | 0.0136 | 0.0142 | 0.0125 | 0.0114 |
field_mul | 0.0175 | 0.0145 | 0.0150 | 0.0128 | 0.0144 |
field_inverse | 1.48 | 1.50 | 1.48 | 1.36 | 1.36 |
field_inverse_var | 0.951 | 0.969 | 0.961 | 0.877 | 0.877 |
field_is_square_var | 1.32 | 1.33 | 1.32 | 1.20 | 1.20 |
field_sqrt | 4.18 | 3.74 | 3.92 | 3.37 | 3.15 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 11.6 | 10.6 | 11.0 | 10.4 | 9.96 |
ecmult_const | 22.6 | 20.2 | 21.3 | 19.9 | 18.9 |
ecmult_1p | 18.1 | 16.0 | 17.0 | 15.8 | 15.0 |
ecmult_0p_g | 13.5 | 11.8 | 12.5 | 11.6 | 11.9 |
ecmult_1p_g | 10.6 | 9.37 | 9.92 | 9.23 | 8.80 |
field_half | 0.00197 | 0.00198 | 0.00196 | 0.00197 | 0.00199 |
field_normalize | 0.00638 | 0.00626 | 0.00639 | 0.00641 | 0.00637 |
field_normalize_weak | 0.00266 | 0.00261 | 0.00264 | 0.00266 | 0.00266 |
field_sqr | 0.0114 | 0.00948 | 0.0114 | 0.00946 | 0.00978 |
field_mul | 0.0152 | 0.0115 | 0.0122 | 0.0113 | 0.0122 |
field_inverse | 1.25 | 1.23 | 1.25 | 1.25 | 1.25 |
field_inverse_var | 0.807 | 0.794 | 0.814 | 0.807 | 0.814 |
field_is_square_var | 1.10 | 1.08 | 1.10 | 1.10 | 1.10 |
field_sqrt | 3.26 | 2.91 | 3.25 | 2.82 | 2.84 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 10.7 | 9.23 | 9.61 | 9.10 | 8.70 |
ecmult_const | 20.1 | 17.7 | 18.7 | 17.5 | 16.6 |
ecmult_1p | 15.9 | 14.1 | 14.9 | 13.9 | 13.2 |
ecmult_0p_g | 11.4 | 9.95 | 10.5 | 9.66 | 9.18 |
ecmult_1p_g | 9.30 | 8.23 | 8.72 | 8.12 | 7.74 |
field_half | 0.00183 | 0.00183 | 0.00171 | 0.00178 | 0.00183 |
field_normalize | 0.00562 | 0.00602 | 0.00571 | 0.00568 | 0.00595 |
field_normalize_weak | 0.00235 | 0.00247 | 0.00233 | 0.00233 | 0.00247 |
field_sqr | 0.00978 | 0.00932 | 0.00986 | 0.00837 | 0.00905 |
field_mul | 0.0132 | 0.0113 | 0.0107 | 0.0102 | 0.0113 |
field_inverse | 1.10 | 1.16 | 1.10 | 1.10 | 1.16 |
field_inverse_var | 0.705 | 0.747 | 0.709 | 0.707 | 0.745 |
field_is_square_var | 0.970 | 1.02 | 0.967 | 0.972 | 1.02 |
field_sqrt | 2.80 | 2.71 | 2.88 | 2.44 | 2.63 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 23.6 | 21.7 | 23.9 | 22.1 | 22.5 |
ecmult_const | 46.6 | 42.1 | 47.1 | 42.7 | 43.4 |
ecmult_1p | 36.1 | 32.4 | 36.4 | 33.0 | 32.9 |
ecmult_0p_g | 25.8 | 23.4 | 26.2 | 23.6 | 23.2 |
ecmult_1p_g | 21.0 | 18.9 | 21.2 | 19.2 | 19.1 |
field_half | 0.00264 | 0.00273 | 0.00270 | 0.00270 | 0.00275 |
field_normalize | 0.00960 | 0.00941 | 0.00962 | 0.00942 | 0.00942 |
field_normalize_weak | 0.00330 | 0.00332 | 0.00333 | 0.00355 | 0.00332 |
field_sqr | 0.0196 | 0.0184 | 0.0204 | 0.0175 | 0.0172 |
field_mul | 0.0241 | 0.0214 | 0.0240 | 0.0219 | 0.0218 |
field_inverse | 1.82 | 1.86 | 1.84 | 1.84 | 1.86 |
field_inverse_var | 1.11 | 1.11 | 1.11 | 1.10 | 1.10 |
field_is_square_var | 1.36 | 1.39 | 1.40 | 1.39 | 1.41 |
field_sqrt | 5.38 | 5.08 | 5.56 | 4.82 | 4.77 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.3 | 15.4 | 16.8 | 15.6 | 14.5 |
ecmult_const | 30.1 | 28.1 | 31.9 | 28.8 | 26.6 |
ecmult_1p | 24.7 | 22.0 | 24.8 | 22.8 | 20.8 |
ecmult_0p_g | 16.5 | 15.5 | 17.5 | 15.7 | 14.5 |
ecmult_1p_g | 14.6 | 12.9 | 14.4 | 13.2 | 12.1 |
field_half | 0.00255 | 0.00219 | 0.00221 | 0.00238 | 0.00199 |
field_normalize | 0.00661 | 0.00646 | 0.00647 | 0.00655 | 0.00647 |
field_normalize_weak | 0.00275 | 0.00270 | 0.00270 | 0.00272 | 0.00270 |
field_sqr | 0.0142 | 0.0117 | 0.0140 | 0.0124 | 0.0115 |
field_mul | 0.0161 | 0.0136 | 0.0155 | 0.0144 | 0.0142 |
field_inverse | 1.28 | 1.26 | 1.25 | 1.27 | 1.26 |
field_inverse_var | 0.833 | 0.822 | 0.815 | 0.826 | 0.816 |
field_is_square_var | 1.10 | 1.09 | 1.09 | 1.10 | 1.07 |
field_sqrt | 3.81 | 3.35 | 3.84 | 3.47 | 3.30 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.1 | 15.5 | 16.7 | 15.3 | 14.6 |
ecmult_const | 30.6 | 28.4 | 31.6 | 28.1 | 26.7 |
ecmult_1p | 25.4 | 22.4 | 25.0 | 22.1 | 20.8 |
ecmult_0p_g | 17.7 | 16.5 | 18.0 | 16.3 | 15.3 |
ecmult_1p_g | 15.0 | 13.1 | 14.1 | 13.2 | 12.2 |
field_half | 0.00223 | 0.00221 | 0.00224 | 0.00224 | 0.00218 |
field_normalize | 0.00649 | 0.00642 | 0.00642 | 0.00643 | 0.00645 |
field_normalize_weak | 0.00270 | 0.00267 | 0.00267 | 0.00267 | 0.00267 |
field_sqr | 0.0142 | 0.0115 | 0.0139 | 0.0121 | 0.0114 |
field_mul | 0.0158 | 0.0132 | 0.0154 | 0.0140 | 0.0140 |
field_inverse | 1.26 | 1.24 | 1.24 | 1.25 | 1.25 |
field_inverse_var | 0.819 | 0.808 | 0.817 | 0.816 | 0.811 |
field_is_square_var | 1.08 | 1.08 | 1.08 | 1.08 | 1.07 |
field_sqrt | 3.75 | 3.33 | 3.81 | 3.39 | 3.27 |
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 14.0 | 12.7 | 13.8 | 13.1 | 12.3 |
ecmult_const | 26.4 | 23.9 | 26.7 | 24.8 | 23.0 |
ecmult_1p | 20.7 | 18.6 | 20.9 | 19.5 | 18.2 |
ecmult_0p_g | 15.1 | 13.6 | 15.4 | 14.3 | 13.2 |
ecmult_1p_g | 12.1 | 10.8 | 12.1 | 11.4 | 10.4 |
field_half | 0.00213 | 0.00196 | 0.00189 | 0.00189 | 0.00189 |
field_normalize | 0.00556 | 0.00564 | 0.00549 | 0.00561 | 0.00559 |
field_normalize_weak | 0.00228 | 0.00229 | 0.00228 | 0.00230 | 0.00229 |
field_sqr | 0.0115 | 0.0106 | 0.0112 | 0.0104 | 0.00989 |
field_mul | 0.0137 | 0.0118 | 0.0136 | 0.0123 | 0.0123 |
field_inverse | 1.09 | 1.10 | 1.07 | 1.10 | 1.10 |
field_inverse_var | 0.670 | 0.680 | 0.657 | 0.681 | 0.675 |
field_is_square_var | 0.855 | 0.889 | 0.860 | 0.881 | 0.857 |
field_sqrt | 3.34 | 2.98 | 3.17 | 2.95 | 2.82 |
Hi everyone,
I've integrated the new implementations (CryptOpt Assembly and Fiat-C) into my fork, and it works with ./configure --with-asm=no
and ./configure --with-asm=x86_64
.
I am unsure how to run the ci, to see if it works (especially on Windows).
Also, the current commits don't check if the CPU features flags BMI2
nor ADX
. Is the idea to check that at runtime or compile time or both?
Also, I can't figure out how I would make the field arithmetic static and inline again (if that is needed).
I'm happy to do the bulk work but I'd need some guidance on the following
src/third_party
) and whether to force them to be static./third_party/*.c
)Hi everyone,
I've updated my fork in three ways:
rdx
and rdi
. (Does anyone know hot to tell CC that those are clobbered AND read at the same time)?./build-aux/m4/bitcoin_secp.m4
. I've found existing macros to check common cpu features and to check via builtin already, but I was unable to reliably detect adx
(especially with clang, which does not support those in __builtin_cpu_supports
. c.f. the list of supported flagsI believe the last thing for a PR would be to check during the CMAKE build process, if BMI2 and ADX are available. Does anyone have other ideas on what we need here?
@dderjoel Thanks for the updates, I'm very glad to see the progress here.
My thinking is that we'll actually want to split this into two separate efforts, which we can discuss and consider with different timelines:
The first is straightforward, I think: your benchmarks show that the Fiat-Crypto generated C code (with the #810 optimization) is competitive with the C code we already have, so I think there is little reason not to just outright replace that code. Since there is no reason to keep the old code too, we don't need any decision logic or build system support to choose one over the other. I have some comments on the implementation, but that can wait for a PR. Overall, this feels very close to usable.
Introducing CryptoOpt is a more complex matter, because of the use of ADCX/ADOX/MULX it cannot be a drop-in replacement for the assembly code we already have. While I have no problem with optimizing exclusively for architectures which do have these, we can't just drop support for those who don't (but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO). There are a number of options:
Separate comment, as it feels unrelated to the rest.
In inline asm blocks, all variable templates fall in one of these categories:
=&
)=
)+
)Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won't.
If there is a PR, I'm happy to look over it.
I agree that we should add Fiat-Crypto first separately. I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.
Introducing CryptoOpt is a more complex matter [...]
I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection. Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we'd probably leave performance on the table.
I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.
Indeed. I wonder if @roconnor-blockstream would be interested in having a look at that too...
(Thinking out loud) would it be possible to use Klee to prove the C code that comes out is always correct, independent from the Fiat-crypto proofs? Actually, if that's feasible, perhaps we could try to do that right now with the current C code too.
I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection.
That seems reasonable.
Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we'd probably leave performance on the table.
I certainly wouldn't have non-ADCX-cryptopt-asm as the only x86_64 asm code in the library longer term; that seems like a waste. But if we don't, what do we do for non-ADCX systems in the runtime autodetection world?
@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.
- Replacing the pure-C code with Fiat-Crypto generated C code.
I agree that we should add Fiat-Crypto first separately.
Will create a PR later which only replaces the 64bit C code. We can then continue to discuss there.
maybe with a simple abort at run time, e.g., just call ADX in the self-check
That seems reasonable.
Would that be in src/selftest.h:29? something like
static int secp256k1_selftest_passes(void) {
int ret = 1;
#if defined(USE_ASM_X86_64)
__asm__ __volatile__("cpuid\n"
"mov %%rbx, %%rax\n"
"shr $19, %%rbx\n"
"shr $8, %%rax\n"
"and %%rbx, %%rax\n"
"and $1, %%rax\n"
: "=a"(ret)
: "a"(7), "c"(0)
: "rdx", "rbx");
#fi
return secp256k1_selftest_sha256() && ret;
}
(but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO)
Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.
Modify CryptoOpt to not use the new instructions,
As @real-or-random already pointed out, that is a bit more complex, see also Issue 143 in the CryptOpt repo, where I explain why.
No autodetection [...], but it won't see widespread use in this mode.
Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.
Have runtime autodetection
Yes, this is quite common in other libraries, too, like openSSL. If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it's a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))
@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.
That is a question for Owen and the Fiat-Crypto team. So if I look at the current 32-bit implementation, and look at the limbs I think its 10 limbs and the last limb has a width of 22 bits.
When I call ./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273'
It fails with
Computed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])).
The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]
and a lot of intermediate code.
On the CryptOpt side, I have not tried incorporating 32-bit code yet. This is probably another engineering task, probably at the order of "get CryptOpt to generate code without ADX / BMI2", as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.
When I call
./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273'
It fails withComputed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])). The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.
Thanks for looking into that, Owen. FWIW, I'm not 100% sure if my invocation is correct, it was just an (somewhat educated) guess.
In inline asm blocks, all variable templates fall in one of these categories:
Input variables
Early-clobber output variable (
=&
)Output variables (
=
)Input-output variables (
+
)(not actually variables) explicitly clobbered registers (which won't be assigned to input or output variables)
Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won't.
If there is a PR, I'm happy to look over it.
So I've created a branch only-asm
which features the asm implementation here.
lines 202 and 203 are the instructions to restore r
and b
into rdi
and rdx
respectively.
That is not always necessary, and also did not run through the equivalence checker.
As it it sometimes dead operations, depending on when/how the code is used, it would be more elegant to tell CC that those are simply clobbered. I'm also unsure with all the memory read/write, if they are needed.
Now I'm having a hard time to sort "D"(r)
and "d"(b)
into your description above.
@dderjoel
Would that be in src/selftest.h:29? something like ...
I think @real-or-random meant something even simpler (just some asm code that uses adc, which presumbly crashes on non-supporting systems), but your suggestion is better (and already introduces detection logic which can be later moved from selftesting to autodetection once that is feasible).
Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.
Great.
As @real-or-random already pointed out, that is a bit more complex, see also https://github.com/0xADE1A1DE/CryptOpt/issues/143, where I explain why.
Ah, I see now why it's nontrivial to avoid those instructions. I assumed you already had support for mul/imul, and it was just a matter of disabling mulx... but I get why mul/imul are actually a lot harder to deal with than mulx.
In either case, we still do have the option of replacing the asm code with fiat-crypto-c-fed-through-c-compiler, if we find a particularly fast combination of compiler/version/options.
Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.
I see where you're coming from now, but no, there is absolutely no assumption that the compilation is for the native system. You tell the libsecp256k1 build system what compiler suite to use, and it targets whatever that compiler is for. By default, that's your standard system compiler, which is most likely for the same architecture (x86/x86_64/arm32/aarch64/...) as the system you're running on, but there is no requirement for that. And even if it's the same architecture, it's not necessarily for your own system. In fact, I believe that there is a significant fraction of libsecp256k1 use that is built through cross-compiling (as compilation for another system than your own is called), as Bitcoin Core's release binaries (which include libsecp256k1) are created on a Linux x86_64 environment, cross-compiled for everything else (including arm32, aarch64, windows x86_64, mac osx, ...).
We absolutely can't do autodetection at compile or configure time. The options are manual configuration, or runtime autodetection.
If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it's a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))
Indeed. That's the advantage of using formal methods, that they lower the review barrier significantly for having multiple options like this.
It'd also open the door for things like #967, which adds a field with a more standard 4x64 limb representation, but which is only faster with asm optimization (because it much more crucially relies on fast carries, which are hard to get the compiler to emit for pure C code).
This is probably another engineering task, probably at the order of "get CryptOpt to generate code without ADX / BMI2", as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.c
32-bit x86 is all but dead, I absolutely wouldn't suggest you spend any time on asm optimizing that.
The 32-bit C code is there for weak hardware, from hardware wallet to (old) Raspberry Pi like devices. But none of those are x86. If you're looking for more work, 32-bit ARM and (increasingly) 64-bit ARM are far more interesting targets than 32-bit x86.
@OwenConoly
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.
Thank you for looking into it!
which presumbly crashes on non-supporting systems
Yes, should trigger an SIGILL
in those cases.
re, build:
Yes, I forgot about the cross compilation and releases.
in the only-asm
branch, I've hooked the detection into the case where the build script is checking anyway, and is conservative to not use the asm then. In other words, where it used to check if it could use x86_64
, it now additionally checks if the flags are set.
However, now I see that this is not enough. For the case, one wants to build for generic x64, one would set --with-asm=x86-64
and this would result in code which needs the flags. So as an alternative, we could use additional switches --has-adx
and --has-bmi2
to ./configure
, and then if (regardless of auto-detected or explicitly set) also --with-asm=x86-64
is set, then the new asm can be used.
Question then is, what the default for this should be (also auto detect or conservative to false?)
If we want to use runtime detection-and-selection, then I believe either the functions must be non-inlined and in e.g. here we could check a global bit which one to jump to, either the adx-asm version or to the C-compiled fallback.
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.
Possibly relevant: the current 10x26 field mul/sqr leave their residual carry in limbs[2] (thus exceeding 26 bits), while the most-significant limb will be fully reduced (22 bits). The 5x52 (C code, not asm) leaves theirs in the most-significant limb (limbs[4]).
See this PR: https://github.com/bitcoin-core/secp256k1/pull/815 , in particular the first commit: https://github.com/bitcoin-core/secp256k1/pull/815/commits/9388b77b03c6e0c09a950e84ba34e9a2d8e0d433 . In that PR @sipa also anticipates that this may be better for fiat-crypto integration: https://github.com/bitcoin-core/secp256k1/pull/815#issuecomment-1540331544 .
I have previously mentioned a side-benefit to our magnitude tracking from rectifying this: https://github.com/bitcoin-core/secp256k1/issues/1060#issuecomment-1009647883 ("Secondly...")
Okay, I'm more than happy to do the work and let the core team here review, but I'd need some corners with some guidance, because I'm not 100% sure of what we want. Currently, I see two options:
only-asm
branch, maybe needs some more verbose error message.)@dderjoel
I think eventually we want runtime check and proper dispatch to the right version, having both compiled in. But at this stage, I think it's more interesting to just have a configure flag that enables either the asm or C (defaulting to C code), plus a selfcheck.
This allows experimentation and benchmarking and testing, but avoids the need to make whatever restructuring is needed for autodetection. Moreover, it makes it easier for someone other than you do make the autodetection happen, as all the actual code would already be in.
Just created a PR for exactly this, can I trigger CI without removing the Draft status? EDIT: never mind, just a minute to trigger itself
I've re-run the sec_bench suite on the exact fork on my 10 test machines, and the results is still very similar to the one before.
implementation | default_asm | default_c | default_c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 15.8059 | 14.9232 | 15.7527 | 15.0524 | 14.3984 |
ecmult_const | 30.0159 | 28.1553 | 30.2437 | 28.3729 | 26.7443 |
ecmult_1p | 23.7137 | 21.99 | 23.5329 | 22.3725 | 20.9531 |
ecmult_0p_g | 16.851 | 15.7264 | 16.8364 | 15.926 | 14.9667 |
ecmult_1p_g | 13.8695 | 12.8423 | 13.7058 | 12.9949 | 12.2609 |
field_half | 0.00246964 | 0.00245409 | 0.0024321 | 0.00240483 | 0.00248599 |
field_normalize | 0.00738854 | 0.00741554 | 0.0073913 | 0.00737073 | 0.00733137 |
field_normalize_weak | 0.00294516 | 0.00294647 | 0.00294284 | 0.0029337 | 0.00292048 |
field_sqr | 0.0141365 | 0.0125201 | 0.0137808 | 0.0123229 | 0.0118652 |
field_mul | 0.0170084 | 0.01443 | 0.0156856 | 0.014615 | 0.0145313 |
field_inverse | 1.40032 | 1.39717 | 1.39222 | 1.39458 | 1.61234 |
field_inverse_var | 0.912255 | 0.913256 | 0.908407 | 0.91439 | 0.909508 |
field_is_square_var | 1.194 | 1.20539 | 1.20361 | 1.20455 | 1.19346 |
field_sqrt | 3.87147 | 3.55153 | 3.83592 | 3.44752 | 3.34927 |
@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.
An update on this: as Joel mentioned, using the same template we used for the 64-bit code did not work to generate the 32-bit code. There are two reasons that it didn't work.
I am curious whether there is a simple explanation (other than "it makes the bounds work out right") for why we need to start the 32-bit mul with this reduction from p17, but we don't need to start the 64-bit mul with a reduction from p7. Ideally, I'd like to be able to answer questions like "Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?" Of course it isn't vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.
I overcame issue (2) above simply by creating a Coq implementation for the 32-bit mul which is separate from the implementation of the 64-bit mul. I followed along the C implementation from the first commit 9388b77 of #815, which @peterdettman mentioned above.
Here's the C code I generated for the 32-bit mul and sqr: secp256k1_dettman_32.c. Would you be interested in including this code in the library as well? I assume we'd want some benchmarks before considering that too seriously, and I gather from the discussion in #815 that not many people have 32-bit machines on which to benchmark things like this.
Note that the generated code does not include the Karatsuba optimization from the third commit of #815. I think it would be interesting to add the Karatsuba multiplication to fiat-crypto, so I would be willing to add that optimization as well.
And the 32-bit code I generated does incorporate the optimization from #810. The handwritten 32-bit code from #815 also has this optimization.
I am curious whether there is a simple explanation (other than "it makes the bounds work out right") for why we need to start the 32-bit mul with this reduction from p17, but we don't need to start the 64-bit mul with a reduction from p7. Ideally, I'd like to be able to answer questions like "Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?" Of course it isn't vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.
Well it pretty much is just "it makes the bounds work out right", but the crux is which of the high partial products can we do last so that our residual carry lands neatly in the most-significant-limb. We then start one higher than that.
For 10x26: p18 produces a 91 bit reduced result that doesn't fit in r8, r9 i.e. 49 bits. p17 produces a 95 bit reduced result that doesn't fit in r7, r8, r9 i.e. 75 bits. p16 produces a 98 bit reduced result that does fit in r6, r7, r8, r9 i.e. 101 bits.
So our best option for the last partial products step is p6/p16, and a final carry chain landing in r9 as desired.
For 5x52: p8 produces a 143 bit reduced result that doesn't fit in r3, r4 i.e. 101 bits. p7 produces a 147 bit reduced result that does fit in r2, r3, r4 i.e. 153 bits.
So our best option for the last partial products step is p2/p7, and a final carry chain landing in r4 as desired.
For our field there's (conservatively) an extra bit coming from the lower partial product in each pair and the carry-in (that I ignored above), but for a field with a prime very (very) close to a power of 2, the lower partial product would need to be included in the analysis properly.
Regarding 16-bit, I doubt a practical algorithm would try to follow the pattern of these implementations. If a reduced radix is even tenable I would guess it would have to use Karatsuba, perhaps with 20 limbs in four groups of 64 bits: (13 13 13 13 12). (Note that this is already incompatible with the current assumption of 4 extra bits for carry-free linear field ops).
Also, +1 in principle to including also the 32-bit code.
fiat-crypto can generate verified field code for multiple targets, e.g., C and x86_64 asm. It has algorithm templates for our mul and sqr algorithm (under the name "Dettman") for secp256k1's field. But the current implementation misses at least one optimization that we have in our C code (but not asm), namely the one #810.
CryptOpt can optimize fiat-crypto's output, producing significantly faster asm while retaining the formal guarantees.
A plausible route towards using these projects in this library is: