real-or-random commented 1 year ago

fiat-crypto can generate verified field code for multiple targets, e.g., C and x86_64 asm. It has algorithm templates for our mul and sqr algorithm (under the name "Dettman") for secp256k1's field. But the current implementation misses at least one optimization that we have in our C code (but not asm), namely the one #810.

CryptOpt can optimize fiat-crypto's output, producing significantly faster asm while retaining the formal guarantees.

A plausible route towards using these projects in this library is:

[x] Optimize the algorithm template in fiat-crypto code further by implementing the optimization from #810 and perhaps further refinements. This is tracked in https://github.com/mit-plv/fiat-crypto/issues/1582. This would ideally be done by the fiat-crypto folks (in particular @OwenConoly who contributed the initial implementation).
[ ] If fiat-crypto's C output is fast enough (ideally on par with our current C code), replace the current C code by fiat-crypto's C code.
[ ] If CryptoOpt provides a significant enough further speedup, replace the current x86_64 asm by fiat-crypto + CryptOpt x86_64 asm. Since the current output of CryptOpt is not fully generic x86_64 asm, we need either of these two:
- [ ] Add CPU id on our side
- [ ] Modify CryptOpt so that it uses only instructions available on all x86_64 CPUs

dderjoel commented 1 year ago

Hi everyone, It took me a bit but I now have some numbers.

I've used those nine machines	CPU	𝜇-architecture
Intel Core i7-6770HQ	Skylake-H
Intel Core i7-10710U	Comet Lake-U
Intel Core i9-10900K	Comet Lake-S
Intel Core i7-11700KF	Rocket Lake-S
Intel Core i9-13900KF	Raptor Lake-S
AMD Ryzen Theadr. 1900X	Zen 1
AMD Ryzen 7 5800X	Zen 3
AMD Ryzen 9 5950X	Zen 3
AMD Ryzen 9 7950X	Zen 4

I've created a little project here, which takes this library as the base. For default_asm, it compiles with ./configure --with-asm=x86_64 using the current asm implementation. For default_c, it compiles with ./configure --with-asm=no using the current c implementation. For default_c52, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no. I've included that to compare that with fiat_c. For fiat_c, it replaces the c implementation with the Fiat Cryptography implementation (which does not feature the #810 optimization.) For fiat_cryptopt, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I've included the 'on average best' one. So on average, we see that for the ecmult_gen the fiat_c is more performant than the default_asm and fiat_cryptopt is even faster than that and the current default_c.

Numbers for the secp256k1 benchmarks ./bench_internal simple and ./bench_ecmult (I've commented out the multi_benches to succeed in a timely manner.)

Geometric mean over nine machines (per-machine tables at the end of the post)

implementation	default_asm	default_c	default_c52	fiat_c	fiat_cryptopt
ecmult_gen	24.6435	23.2049	24.7301	24.1802	22.8905
ecmult_const	46.798	43.8816	47.4836	45.7434	42.6387
ecmult_1p	37.0311	34.1909	37.06	35.7762	33.1715
ecmult_0p_g	25.3775	23.7174	26.1944	25.0963	23.3779
ecmult_1p_g	21.6629	19.9417	21.5685	20.8703	19.412
field_half	0.00333343	0.00333149	0.00334944	0.00342343	0.0033374
field_normalize	0.0112975	0.0113496	0.0113459	0.0112799	0.0112925
field_normalize_weak	0.00447881	0.00451408	0.00449407	0.00450251	0.00453795
field_sqr	0.0215233	0.0194716	0.0213214	0.0189657	0.0190589
field_mul	0.0258206	0.0219956	0.0244374	0.0240141	0.0226145
field_inverse	2.13491	2.13158	2.13192	2.13569	2.14261
field_inverse_var	1.39323	1.39281	1.3899	1.39515	1.39883
field_is_square_var	1.82198	1.83353	1.84074	1.83197	1.82702
field_sqrt	5.94338	5.4591	5.88598	5.29186	5.28916

In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt is again the most performant one.

Implementation	Lang.	1900X	5800X	5950X	7950X	i7 6G	i7 10G	i9 10G	i7 11G	i9 13G	G.M.
(`default_asm`) libsecp256k1 [Bitcoin Core 2021]	asm	720k (1.07x)	567k (1.05x)	567k (1.05x)	560k (1.06x)	565k (1.06x)	564k (1.06x)	564k (1.06x)	539k (1.08x)	439k (1.06x)	561k (1.06x)
libsecp256k1+Best (Opt)	asm	712k (1.06x)	545k (1.01x)	545k (1.01x)	543k (1.02x)	541k (1.01x)	541k (1.02x)	541k (1.02x)	512k (1.03x)	448k (1.08x)	544k (1.02x)
(`default_c52`) libsecp256k1 [Bitcoin Core 2021] (w/o 810)	C	699k (1.04x)	561k (1.04x)	559k (1.04x)	546k (1.03x)	575k (1.08x)	574k (1.08x)	572k (1.08x)	538k (1.08x)	454k (1.09x)	561k (1.06x)
(`default_c`) libsecp256k1 [Bitcoin Core 2021] (w/ 810)	C	671k (1.00x)	538k (1.00x)	538k (1.00x)	530k (1.00x)	561k (1.05x)	563k (1.06x)	562k (1.06x)	508k (1.02x)	435k (1.05x)	542k (1.02x)
(`fiat_c`) libsecp256k1 + Fiat-C	C	671k (1.00x)	541k (1.00x)	540k (1.00x)	537k (1.01x)	551k (1.03x)	551k (1.04x)	550k (1.04x)	508k (1.02x)	425k (1.02x)	538k (1.01x)
(`fiat_cryptopt`) libsecp256k1+Best (Fiat)	asm	703k (1.05x)	539k (1.00x)	539k (1.00x)	536k (1.01x)	533k (1.00x)	530k (1.00x)	531k (1.00x)	498k (1.00x)	415k (1.00x)	532k (1.00x)

Intel Core i7-6770HQ

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	29.3	29.1	30.2	30.1	27.1
ecmult_const	54.9	55.0	57.6	57.0	49.6
ecmult_1p	42.8	42.7	44.8	44.3	38.3
ecmult_0p_g	29.3	29.6	31.7	31.3	27.0
ecmult_1p_g	25.1	24.9	26.1	25.9	22.5
field_half	0.00425	0.00425	0.00429	0.00429	0.00424
field_normalize	0.0143	0.0144	0.0145	0.0143	0.0144
field_normalize_weak	0.00566	0.00585	0.00566	0.00566	0.00584
field_sqr	0.0243	0.0237	0.0242	0.0218	0.0214
field_mul	0.0302	0.0266	0.0285	0.0305	0.0253
field_inverse	2.70	2.69	2.69	2.70	2.68
field_inverse_var	1.82	1.80	1.80	1.80	1.80
field_is_square_var	2.31	2.34	2.36	2.30	2.32
field_sqrt	6.79	6.50	6.58	6.10	5.87

Intel Core i7-10710U

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	70.0	69.3	72.1	71.7	64.7
ecmult_const	131.0	131.0	137.0	136.0	118.0
ecmult_1p	102.0	102.0	107.0	106.0	91.5
ecmult_0p_g	70.0	70.9	74.7	73.6	63.4
ecmult_1p_g	59.9	59.6	62.4	61.8	53.8
field_half	0.00997	0.00993	0.0101	0.0105	0.00994
field_normalize	0.0344	0.0344	0.0344	0.0344	0.0345
field_normalize_weak	0.0132	0.0137	0.0132	0.0132	0.0137
field_sqr	0.0576	0.0559	0.0572	0.0515	0.0507
field_mul	0.0714	0.0629	0.0722	0.0676	0.0599
field_inverse	6.16	6.16	6.11	6.15	6.15
field_inverse_var	4.33	4.32	4.32	4.36	4.32
field_is_square_var	5.65	5.59	5.63	5.68	5.59
field_sqrt	16.2	15.5	15.7	14.5	14.0

Intel Core i9-10900K

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	20.5	20.4	21.1	21.1	19.0
ecmult_const	38.4	38.5	40.3	39.8	34.7
ecmult_1p	30.0	29.9	31.3	31.0	26.8
ecmult_0p_g	20.6	20.7	21.8	21.5	18.5
ecmult_1p_g	17.6	17.4	18.3	18.1	15.7
field_half	0.00296	0.00296	0.00297	0.00293	0.00298
field_normalize	0.0101	0.0101	0.0101	0.0101	0.0101
field_normalize_weak	0.00393	0.00393	0.00392	0.00393	0.00392
field_sqr	0.0171	0.0166	0.0171	0.0153	0.0151
field_mul	0.0212	0.0187	0.0201	0.0201	0.0178
field_inverse	1.80	1.80	1.80	1.80	1.80
field_inverse_var	1.27	1.27	1.27	1.27	1.26
field_is_square_var	1.62	1.65	1.64	1.62	1.62
field_sqrt	4.75	4.54	4.62	4.26	4.10

Intel Core i7-11700KF

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	19.9	17.8	19.3	18.8	17.8
ecmult_const	38.3	34.2	37.1	35.4	33.4
ecmult_1p	29.9	26.4	28.9	27.3	26.0
ecmult_0p_g	20.9	18.4	20.1	18.8	18.1
ecmult_1p_g	17.5	15.5	16.7	15.9	15.3
field_half	0.00304	0.00303	0.00303	0.00303	0.00305
field_normalize	0.0102	0.0105	0.0104	0.0101	0.0101
field_normalize_weak	0.00394	0.00394	0.00410	0.00415	0.00416
field_sqr	0.0195	0.0171	0.0181	0.0170	0.0164
field_mul	0.0225	0.0182	0.0191	0.0196	0.0203
field_inverse	1.89	1.89	1.89	1.89	1.97
field_inverse_var	1.22	1.22	1.21	1.22	1.25
field_is_square_var	1.67	1.67	1.68	1.67	1.68
field_sqrt	5.35	4.69	5.12	4.51	4.65

Intel Core i9-13900KF

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	19.5	17.8	18.6	17.9	17.6
ecmult_const	38.4	34.3	36.2	34.6	33.8
ecmult_1p	30.7	27.3	28.8	27.5	26.8
ecmult_0p_g	21.4	18.8	20.0	19.1	18.6
ecmult_1p_g	18.0	16.0	16.8	16.1	15.7
field_half	0.00330	0.00332	0.00337	0.00335	0.00333
field_normalize	0.0109	0.0110	0.0110	0.0109	0.0109
field_normalize_weak	0.00452	0.00451	0.00449	0.00450	0.00451
field_sqr	0.0191	0.0169	0.0193	0.0167	0.0177
field_mul	0.0256	0.0202	0.0210	0.0196	0.0191
field_inverse	2.13	2.12	2.13	2.13	2.13
field_inverse_var	1.37	1.37	1.37	1.36	1.38
field_is_square_var	1.88	1.87	1.87	1.88	1.88
field_sqrt	5.43	4.97	5.56	4.82	5.04

AMD Ryzen Theadripper 1900X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	25.7	23.5	26.0	24.4	25.1
ecmult_const	50.9	45.9	51.4	47.4	48.5
ecmult_1p	39.6	35.5	40.0	37.1	37.1
ecmult_0p_g	27.5	24.7	28.5	26.3	26.6
ecmult_1p_g	23.0	20.6	23.2	21.6	21.5
field_half	0.00270	0.00270	0.00271	0.00293	0.00269
field_normalize	0.0104	0.0104	0.0104	0.0104	0.0104
field_normalize_weak	0.00365	0.00365	0.00365	0.00365	0.00365
field_sqr	0.0216	0.0202	0.0224	0.0198	0.0202
field_mul	0.0267	0.0235	0.0263	0.0246	0.0244
field_inverse	2.04	2.04	2.05	2.05	2.04
field_inverse_var	1.21	1.22	1.21	1.22	1.21
field_is_square_var	1.51	1.50	1.53	1.51	1.56
field_sqrt	5.96	5.60	6.10	5.50	5.54

AMD Ryzen 5800X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	20.6	19.2	20.7	20.3	19.3
ecmult_const	38.2	35.2	39.3	37.6	35.5
ecmult_1p	31.4	27.5	30.6	29.5	27.7
ecmult_0p_g	20.9	19.1	21.6	20.5	19.4
ecmult_1p_g	18.4	16.0	17.8	17.2	16.2
field_half	0.00252	0.00251	0.00251	0.00264	0.00251
field_normalize	0.00818	0.00818	0.00818	0.00819	0.00818
field_normalize_weak	0.00343	0.00343	0.00343	0.00343	0.00343
field_sqr	0.0180	0.0147	0.0177	0.0152	0.0157
field_mul	0.0199	0.0167	0.0194	0.0195	0.0191
field_inverse	1.58	1.58	1.58	1.58	1.58
field_inverse_var	1.03	1.03	1.03	1.03	1.03
field_is_square_var	1.37	1.38	1.38	1.38	1.36
field_sqrt	4.72	4.23	4.83	4.30	4.35

AMD Ryzen 5950X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	23.1	21.4	23.2	22.7	21.6
ecmult_const	42.6	39.3	43.9	42.0	39.6
ecmult_1p	34.9	30.7	34.1	32.9	30.9
ecmult_0p_g	23.1	21.2	24.7	23.4	22.3
ecmult_1p_g	20.4	17.9	19.9	19.2	18.1
field_half	0.00280	0.00280	0.00280	0.00294	0.00280
field_normalize	0.00914	0.00912	0.00913	0.00913	0.00913
field_normalize_weak	0.00382	0.00383	0.00382	0.00382	0.00382
field_sqr	0.0200	0.0164	0.0197	0.0169	0.0176
field_mul	0.0221	0.0188	0.0216	0.0216	0.0213
field_inverse	1.77	1.76	1.76	1.77	1.77
field_inverse_var	1.15	1.15	1.15	1.15	1.18
field_is_square_var	1.52	1.54	1.54	1.54	1.52
field_sqrt	5.26	4.71	5.39	4.80	4.85

AMD Ryzen 7950X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	16.8	15.5	16.8	16.4	15.8
ecmult_const	32.0	29.2	32.4	31.0	29.8
ecmult_1p	25.1	22.7	25.3	24.4	23.4
ecmult_0p_g	17.4	15.8	18.4	17.6	17.0
ecmult_1p_g	14.6	13.2	14.7	14.2	13.6
field_half	0.00212	0.00212	0.00212	0.00212	0.00213
field_normalize	0.00698	0.00697	0.00696	0.00695	0.00696
field_normalize_weak	0.00290	0.00290	0.00290	0.00290	0.00290
field_sqr	0.0143	0.0130	0.0141	0.0128	0.0125
field_mul	0.0165	0.0142	0.0170	0.0161	0.0149
field_inverse	1.34	1.34	1.34	1.34	1.34
field_inverse_var	0.825	0.827	0.825	0.838	0.825
field_is_square_var	1.06	1.09	1.09	1.09	1.06
field_sqrt	4.12	3.62	3.93	3.50	3.51

dderjoel commented 1 year ago

In regards to merging this, In the I've included the implementations as .c files. And the methods are not static. In the case for the libsecp-benchmarks I've added the implementation as a c-source to the bench{,_internal,_ecmult}_SOURCES targets (see here).

Side note, I wanted to include the implementations as .asm files first (as those are formally verified) but then I think we would be in trouble once we compile for non System V- ABI's. So I've used nasm to assemble the asm files and objdump to dump the code in at&t syntax (afaik, otherwise clang complains sometimes) and include them into the c-source. With that, I hope the compiler is able (if needed) swap registers of the arguments if one compiles for non System V. And when I tried to include the methods as inline static in .h files, any optimization-level of the compiler resulted in segfaults, because the registers were not set correctly. (I'm too unfamiliar with this inline-asm notation).

In regards to "tell cryptopt to emit general code (i.e. no mulx, adox, adcx) That is not too trivial. I'll see what I can do, but I don't think we should expect anything here too soon.

real-or-random commented 1 year ago

Thanks, you have a nice benchmark suite. :) Interestingly, fiat_c beats default_c on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?

Anyway, I don't think that any of this should change the plan sketched above. Since we're not in a hurry, it would be awesome to have https://github.com/mit-plv/fiat-crypto/issues/1582 solved first.

In regards to "tell cryptopt to emit general code (i.e. no mulx, adox, adcx) That is not too trivial. I'll see what I can do, but I don't think we should expect anything here too soon.

Oh okay, we assumed it's a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.

dderjoel commented 1 year ago

fiat_c beats default_c on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?

SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{} I've just used the default flags (see below output of ./configure). I'm happy to do that with other compilers, too, if you think that's useful.

Oh okay, we assumed it's a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.

I'll keep it in mind. I'm a bit divided on this though. CPUs since Oct '14 (Broadwell) support BMI2 / ADX. So it feel's wrong to not use them. Would be interesting to see what the usage of this library is distributed... On the other hand I do understand to be as flexible as possible.

I'm using Clang 15 and GCC 12.

Supercop compiler settings

gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE
clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -Os -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O2 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -march=native -O -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE

output of `./configure` on i7-11700K, `./fiat_cryptopt`

Build Options:
  with external callbacks = no
  with benchmarks         = yes
  with tests              = yes
  with ctime tests        = yes
  with coverage           = no
  with examples           = no
  module ecdh             = yes
  module recovery         = no
  module extrakeys        = yes
  module schnorrsig       = yes

  asm                     = x86_64
  ecmult window size      = 15
  ecmult gen prec. bits   = 4

  valgrind                = yes
  CC                      = gcc
  CPPFLAGS                =  
  SECP_CFLAGS             = -O2  -std=c89 -pedantic -Wno-long-long -Wnested-externs -Wshadow -Wstrict-prototypes -Wundef -Wno-overlength-strings -Wall -Wno-unused-function -Wextra -Wcast-align -Wcast-align=strict -fvisibility=hidden 
  CFLAGS                  = -g -O2
  LDFLAGS                 =

real-or-random commented 1 year ago

SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{} I've just used the default flags (see below output of ./configure).

Makes sense.

I'm happy to do that with other compilers, too, if you think that's useful.

I'm not convinced that it will provide much more insight right now.

CPUs since Oct '14 (Broadwell) support BMI2 / ADX. So it feel's wrong to not use them. Would be interesting to see what the usage of this library is distributed...

Yep, exactly my thoughts. If you really have an older CPU, the C code will serve you well.

jonasnick commented 1 year ago

Concept ACK on investigating fiat-crypto and (longer term) cryptopt. Would be great to have a verified implementation that's only slightly slower or even faster.

Thanks @dderjoel for the benchmarks.

dderjoel commented 1 year ago

Owen implemented #810 in the Fiat repo (see PR https://github.com/mit-plv/fiat-crypto/issues/1582), it's not yet fully merged, but I could not wait to benchmark it.

I've updated secp_bench and supercop with the new C and CryptOpt-optimized versions.

I've also added our 12th Gen machine back into the benchmark suite, here the updated numbers (copied post from above, indicating changes):

I've used those ~nine~ ten machines	CPU	𝜇-architecture
Intel Core i7-6770HQ	Skylake-H
Intel Core i7-10710U	Comet Lake-U
Intel Core i9-10900K	Comet Lake-S
Intel Core i7-11700KF	Rocket Lake-S
Intel Core i9-12900KF	Alder Lake-S
Intel Core i9-13900KF	Raptor Lake-S
AMD Ryzen Theadr. 1900X	Zen 1
AMD Ryzen 7 5800X	Zen 3
AMD Ryzen 9 5950X	Zen 3
AMD Ryzen 9 7950X	Zen 4

I've created a little project here, which takes this library as the base. For default_asm, it compiles with ./configure --with-asm=x86_64 using the current asm implementation. For default_c, it compiles with ./configure --with-asm=no using the current c implementation. For default_c52, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no. I've included that to compare that with fiat_c. For fiat_c, it replaces the c implementation with the ~Fiat Cryptography implementation (which does not feature the #810 optimization.)~ the Merge-in-progess Fiat Cryptography implementation, now featuring the #810 optimization. For fiat_cryptopt, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I've included the 'on average best' one. (implicitly also now using the optimization) So on average, we see that for the ecmult_gen the fiat_c is more performant than the default_asm and slightly more performant than default_c (plus giving formal assurance) and fiat_cryptopt is ~even faster than that and the current default_c.~ the fastest (plus formal assurance) .

Numbers for the secp256k1 benchmarks ./bench_internal simple and ./bench_ecmult (I've commented out the multi_benches to succeed in a timely manner.)

Geometric mean over ~nine~ ten machines (per-machine tables at the end of the post)

implementation	default_asm	default_c	default_c52	fiat_c	fiat_cryptopt
ecmult_gen	16.1257	14.9874	15.8764	14.9714	14.1071
ecmult_const	30.3317	28.2041	30.3634	28.1846	26.1778
ecmult_1p	23.9575	22.0809	23.641	22.1608	20.5615
ecmult_0p_g	16.9942	15.8604	16.9429	15.9223	14.7684
ecmult_1p_g	14.0591	12.8636	13.7521	12.9906	12.0012
field_half	0.0024964	0.0024537	0.00237626	0.0023697	0.00237259
field_normalize	0.00737954	0.00744617	0.00736519	0.00732155	0.00733252
field_normalize_weak	0.00294988	0.00295974	0.00293822	0.00295422	0.00294591
field_sqr	0.014059	0.0126383	0.0137789	0.0123009	0.0119273
field_mul	0.0170241	0.0144868	0.0156579	0.014689	0.0144257
field_inverse	1.40048	1.40896	1.39608	1.38928	1.3986
field_inverse_var	0.915942	0.919022	0.912739	0.907769	0.910547
field_is_square_var	1.20313	1.21123	1.20451	1.19417	1.19385
field_sqrt	3.87547	3.56868	3.82354	3.44445	3.35331

In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt is again the most performant one. (In fact everywhere except on 1900X, where the Fiat-C is the most performant one)

Implementation	Lang.	1900X	5800X	5950X	7950X	i7 6G	i7 10G	i9 10G	i7 11G	i9 12G	i9 13G	G.M.
(`default_asm`) libsecp256k1 [Bitcoin Core 2021]	asm	719k (1.11x)	571k (1.11x)	567k (1.10x)	560k (1.11x)	565k (1.07x)	565k (1.08x)	564k (1.08x)	540k (1.13x)	438k (1.09x)	439k (1.11x)	548k (1.09x)
libsecp256k1+Best (Opt)	asm	712k (1.10x)	543k (1.06x)	545k (1.06x)	544k (1.08x)	541k (1.03x)	541k (1.03x)	542k (1.04x)	512k (1.07x)	449k (1.12x)	447k (1.13x)	533k (1.07x)
(`default_c52`) libsecp256k1 [Bitcoin Core 2021] (w/o 810)	C	699k (1.08x)	560k (1.09x)	561k (1.09x)	547k (1.09x)	574k (1.09x)	574k (1.10x)	573k (1.10x)	536k (1.12x)	459k (1.14x)	456k (1.15x)	550k (1.10x)
(`default_c`) libsecp256k1 [Bitcoin Core 2021] (w/ 810)	C	671k (1.04x)	538k (1.05x)	537k (1.04x)	531k (1.06x)	567k (1.08x)	563k (1.07x)	562k (1.08x)	504k (1.06x)	440k (1.09x)	439k (1.11x)	531k (1.06x)
(`fiat_c`) libsecp256k1 + Fiat-C	C	648k (1.00x)	537k (1.05x)	536k (1.04x)	536k (1.07x)	549k (1.04x)	546k (1.04x)	548k (1.05x)	492k (1.03x)	418k (1.04x)	418k (1.05x)	519k (1.04x)
(`fiat_cryptopt`) libsecp256k1+Best (Fiat)	asm	675k (1.04x)	514k (1.00x)	515k (1.00x)	503k (1.00x)	526k (1.00x)	524k (1.00x)	523k (1.00x)	477k (1.00x)	402k (1.00x)	397k (1.00x)	500k (1.00x)

Intel Core i7-6770HQ

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	21.9	21.6	22.4	22.2	20.0
ecmult_const	40.7	40.8	42.7	41.7	36.4
ecmult_1p	31.8	31.7	33.2	32.4	28.1
ecmult_0p_g	22.7	22.9	24.0	23.4	20.2
ecmult_1p_g	18.6	18.5	19.4	18.9	16.5
field_half	0.00470	0.00441	0.00364	0.00436	0.00378
field_normalize	0.0106	0.0108	0.0107	0.0107	0.0107
field_normalize_weak	0.00420	0.00420	0.00421	0.00422	0.00421
field_sqr	0.0184	0.0175	0.0180	0.0169	0.0162
field_mul	0.0224	0.0198	0.0213	0.0221	0.0181
field_inverse	1.99	2.00	2.01	1.99	2.00
field_inverse_var	1.36	1.34	1.34	1.34	1.34
field_is_square_var	1.73	1.74	1.74	1.72	1.73
field_sqrt	5.04	4.83	4.90	4.66	4.44

Intel Core i7-10710U

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	19.8	19.3	20.1	19.7	17.8
ecmult_const	35.8	35.7	37.4	36.4	31.9
ecmult_1p	28.1	28.2	29.4	28.6	24.9
ecmult_0p_g	20.3	20.6	21.4	21.0	18.2
ecmult_1p_g	16.5	16.4	17.2	16.8	14.6
field_half	0.00327	0.00370	0.00356	0.00280	0.00368
field_normalize	0.00956	0.00956	0.00932	0.00954	0.00932
field_normalize_weak	0.00373	0.00373	0.00364	0.00387	0.00364
field_sqr	0.0167	0.0157	0.0157	0.0151	0.0144
field_mul	0.0201	0.0177	0.0188	0.0186	0.0163
field_inverse	1.71	1.72	1.72	1.71	1.72
field_inverse_var	1.21	1.21	1.21	1.22	1.22
field_is_square_var	1.60	1.54	1.56	1.57	1.56
field_sqrt	4.50	4.22	4.30	4.09	3.90

Intel Core i9-10900K

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	14.9	14.8	15.3	15.1	13.6
ecmult_const	27.9	27.9	29.5	28.5	24.9
ecmult_1p	21.8	21.7	22.7	22.2	19.3
ecmult_0p_g	15.0	15.2	15.8	15.4	13.3
ecmult_1p_g	12.8	12.6	13.3	13.0	11.3
field_half	0.00222	0.00218	0.00218	0.00215	0.00218
field_normalize	0.00739	0.00731	0.00731	0.00733	0.00731
field_normalize_weak	0.00285	0.00285	0.00286	0.00285	0.00285
field_sqr	0.0125	0.0120	0.0123	0.0115	0.0111
field_mul	0.0154	0.0136	0.0145	0.0142	0.0125
field_inverse	1.31	1.31	1.31	1.31	1.31
field_inverse_var	0.920	0.920	0.919	0.920	0.917
field_is_square_var	1.18	1.19	1.19	1.17	1.17
field_sqrt	3.44	3.30	3.35	3.18	3.04

Intel Core i7-11700KF

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	17.1	14.4	15.1	13.1	12.7
ecmult_const	31.5	27.3	28.0	24.5	23.5
ecmult_1p	23.7	21.2	20.6	19.5	19.5
ecmult_0p_g	16.6	14.7	14.4	14.3	13.6
ecmult_1p_g	14.0	12.2	12.1	11.6	11.4
field_half	0.00239	0.00237	0.00240	0.00223	0.00227
field_normalize	0.00779	0.00837	0.00805	0.00730	0.00740
field_normalize_weak	0.00320	0.00327	0.00325	0.00296	0.00313
field_sqr	0.0155	0.0136	0.0142	0.0125	0.0114
field_mul	0.0175	0.0145	0.0150	0.0128	0.0144
field_inverse	1.48	1.50	1.48	1.36	1.36
field_inverse_var	0.951	0.969	0.961	0.877	0.877
field_is_square_var	1.32	1.33	1.32	1.20	1.20
field_sqrt	4.18	3.74	3.92	3.37	3.15

Intel Core i9-12900KF

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	11.6	10.6	11.0	10.4	9.96
ecmult_const	22.6	20.2	21.3	19.9	18.9
ecmult_1p	18.1	16.0	17.0	15.8	15.0
ecmult_0p_g	13.5	11.8	12.5	11.6	11.9
ecmult_1p_g	10.6	9.37	9.92	9.23	8.80
field_half	0.00197	0.00198	0.00196	0.00197	0.00199
field_normalize	0.00638	0.00626	0.00639	0.00641	0.00637
field_normalize_weak	0.00266	0.00261	0.00264	0.00266	0.00266
field_sqr	0.0114	0.00948	0.0114	0.00946	0.00978
field_mul	0.0152	0.0115	0.0122	0.0113	0.0122
field_inverse	1.25	1.23	1.25	1.25	1.25
field_inverse_var	0.807	0.794	0.814	0.807	0.814
field_is_square_var	1.10	1.08	1.10	1.10	1.10
field_sqrt	3.26	2.91	3.25	2.82	2.84

Intel Core i9-13900KF

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	10.7	9.23	9.61	9.10	8.70
ecmult_const	20.1	17.7	18.7	17.5	16.6
ecmult_1p	15.9	14.1	14.9	13.9	13.2
ecmult_0p_g	11.4	9.95	10.5	9.66	9.18
ecmult_1p_g	9.30	8.23	8.72	8.12	7.74
field_half	0.00183	0.00183	0.00171	0.00178	0.00183
field_normalize	0.00562	0.00602	0.00571	0.00568	0.00595
field_normalize_weak	0.00235	0.00247	0.00233	0.00233	0.00247
field_sqr	0.00978	0.00932	0.00986	0.00837	0.00905
field_mul	0.0132	0.0113	0.0107	0.0102	0.0113
field_inverse	1.10	1.16	1.10	1.10	1.16
field_inverse_var	0.705	0.747	0.709	0.707	0.745
field_is_square_var	0.970	1.02	0.967	0.972	1.02
field_sqrt	2.80	2.71	2.88	2.44	2.63

AMD Ryzen Theadripper 1900X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	23.6	21.7	23.9	22.1	22.5
ecmult_const	46.6	42.1	47.1	42.7	43.4
ecmult_1p	36.1	32.4	36.4	33.0	32.9
ecmult_0p_g	25.8	23.4	26.2	23.6	23.2
ecmult_1p_g	21.0	18.9	21.2	19.2	19.1
field_half	0.00264	0.00273	0.00270	0.00270	0.00275
field_normalize	0.00960	0.00941	0.00962	0.00942	0.00942
field_normalize_weak	0.00330	0.00332	0.00333	0.00355	0.00332
field_sqr	0.0196	0.0184	0.0204	0.0175	0.0172
field_mul	0.0241	0.0214	0.0240	0.0219	0.0218
field_inverse	1.82	1.86	1.84	1.84	1.86
field_inverse_var	1.11	1.11	1.11	1.10	1.10
field_is_square_var	1.36	1.39	1.40	1.39	1.41
field_sqrt	5.38	5.08	5.56	4.82	4.77

AMD Ryzen 5800X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	16.3	15.4	16.8	15.6	14.5
ecmult_const	30.1	28.1	31.9	28.8	26.6
ecmult_1p	24.7	22.0	24.8	22.8	20.8
ecmult_0p_g	16.5	15.5	17.5	15.7	14.5
ecmult_1p_g	14.6	12.9	14.4	13.2	12.1
field_half	0.00255	0.00219	0.00221	0.00238	0.00199
field_normalize	0.00661	0.00646	0.00647	0.00655	0.00647
field_normalize_weak	0.00275	0.00270	0.00270	0.00272	0.00270
field_sqr	0.0142	0.0117	0.0140	0.0124	0.0115
field_mul	0.0161	0.0136	0.0155	0.0144	0.0142
field_inverse	1.28	1.26	1.25	1.27	1.26
field_inverse_var	0.833	0.822	0.815	0.826	0.816
field_is_square_var	1.10	1.09	1.09	1.10	1.07
field_sqrt	3.81	3.35	3.84	3.47	3.30

AMD Ryzen 5950X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	16.1	15.5	16.7	15.3	14.6
ecmult_const	30.6	28.4	31.6	28.1	26.7
ecmult_1p	25.4	22.4	25.0	22.1	20.8
ecmult_0p_g	17.7	16.5	18.0	16.3	15.3
ecmult_1p_g	15.0	13.1	14.1	13.2	12.2
field_half	0.00223	0.00221	0.00224	0.00224	0.00218
field_normalize	0.00649	0.00642	0.00642	0.00643	0.00645
field_normalize_weak	0.00270	0.00267	0.00267	0.00267	0.00267
field_sqr	0.0142	0.0115	0.0139	0.0121	0.0114
field_mul	0.0158	0.0132	0.0154	0.0140	0.0140
field_inverse	1.26	1.24	1.24	1.25	1.25
field_inverse_var	0.819	0.808	0.817	0.816	0.811
field_is_square_var	1.08	1.08	1.08	1.08	1.07
field_sqrt	3.75	3.33	3.81	3.39	3.27

AMD Ryzen 7950X

bench	asm	c	c52	fiat_c	fiat_cryptopt
ecmult_gen	14.0	12.7	13.8	13.1	12.3
ecmult_const	26.4	23.9	26.7	24.8	23.0
ecmult_1p	20.7	18.6	20.9	19.5	18.2
ecmult_0p_g	15.1	13.6	15.4	14.3	13.2
ecmult_1p_g	12.1	10.8	12.1	11.4	10.4
field_half	0.00213	0.00196	0.00189	0.00189	0.00189
field_normalize	0.00556	0.00564	0.00549	0.00561	0.00559
field_normalize_weak	0.00228	0.00229	0.00228	0.00230	0.00229
field_sqr	0.0115	0.0106	0.0112	0.0104	0.00989
field_mul	0.0137	0.0118	0.0136	0.0123	0.0123
field_inverse	1.09	1.10	1.07	1.10	1.10
field_inverse_var	0.670	0.680	0.657	0.681	0.675
field_is_square_var	0.855	0.889	0.860	0.881	0.857
field_sqrt	3.34	2.98	3.17	2.95	2.82

dderjoel commented 1 year ago

Hi everyone, I've integrated the new implementations (CryptOpt Assembly and Fiat-C) into my fork, and it works with ./configure --with-asm=no and ./configure --with-asm=x86_64. I am unsure how to run the ci, to see if it works (especially on Windows).

Also, the current commits don't check if the CPU features flags BMI2 nor ADX. Is the idea to check that at runtime or compile time or both? Also, I can't figure out how I would make the field arithmetic static and inline again (if that is needed).

I'm happy to do the bulk work but I'd need some guidance on the following

[ ] CPU flags check on compile / run time
[ ] Where to put implementations (I've currently put them into src/third_party) and whether to force them to be static
[ ] If not static, then we need CMake integration (to compile ./third_party/*.c)

dderjoel commented 1 year ago

Hi everyone,

I've updated my fork in three ways:

I've included the up to date version from Fiat, i.e. without the unnecessary AND's
I've included the CryptOpt-generated versions into the header files, (as opposed to earlier where I've put them into separate C files). One problem was that the inline assembly did not specify that caller-save registers will be overwritten. Another problem was that rdx and rdi were also overwritten. Now, I've added two instructions to the end, which restore rdx and rdi. (Does anyone know hot to tell CC that those are clobbered AND read at the same time)?
I've included CPU flag detection macros into ./build-aux/m4/bitcoin_secp.m4. I've found existing macros to check common cpu features and to check via builtin already, but I was unable to reliably detect adx (especially with clang, which does not support those in __builtin_cpu_supports. c.f. the list of supported flags

I believe the last thing for a PR would be to check during the CMAKE build process, if BMI2 and ADX are available. Does anyone have other ideas on what we need here?

sipa commented 1 year ago

@dderjoel Thanks for the updates, I'm very glad to see the progress here.

My thinking is that we'll actually want to split this into two separate efforts, which we can discuss and consider with different timelines:

Replacing the pure-C code with Fiat-Crypto generated C code.
Introducing CryptoOpt-generated assembly code.

The first is straightforward, I think: your benchmarks show that the Fiat-Crypto generated C code (with the #810 optimization) is competitive with the C code we already have, so I think there is little reason not to just outright replace that code. Since there is no reason to keep the old code too, we don't need any decision logic or build system support to choose one over the other. I have some comments on the implementation, but that can wait for a PR. Overall, this feels very close to usable.

Introducing CryptoOpt is a more complex matter, because of the use of ADCX/ADOX/MULX it cannot be a drop-in replacement for the assembly code we already have. While I have no problem with optimizing exclusively for architectures which do have these, we can't just drop support for those who don't (but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO). There are a number of options:

Modify CryptoOpt to not use the new instructions, allowing us to use it as a drop-in replacement for the existing asm code. There are of course alternatives to this, like using the output of a well-known, stable, C compiler on the Fiat-Crypto generated C code, and include that output directly as asm code. Either of these options would essentially give us the same level of assurance in the asm code as fiat-crypto-output-fed-to-c-compiler, which is more than what we currently have for the asm code.
No autodetection, and only enable it when explicitly requested at configure time, which someone can do if they know it'll be run on a machine that supports the new instructions. This could be a first step to introduce the CryptoOpt code to the codebase, but it won't see widespread use in this mode.
Have runtime autodetection, to determine whether the machine the code is running on (not the compiler host machine as I have the impression your branch is trying to do) has support for the relevant instructions, and use that to determine which code to use. The CPU feature detection code isn't hard to write, but it may involve some changes to the library to have any kind of runtime-switchable features (e.g. when does the autodetect code run, does it need to go into the context for thread-safety, ...) that we'd need to have a discussion about.

sipa commented 1 year ago

Separate comment, as it feels unrelated to the rest.

In inline asm blocks, all variable templates fall in one of these categories:

Input variables
Early-clobber output variable (=&)
Output variables (=)
Input-output variables (+)
(not actually variables) explicitly clobbered registers (which won't be assigned to input or output variables)

Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won't.

If there is a PR, I'm happy to look over it.

real-or-random commented 1 year ago

I agree that we should add Fiat-Crypto first separately. I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.

Introducing CryptoOpt is a more complex matter [...]

I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection. Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we'd probably leave performance on the table.

sipa commented 1 year ago

I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.

Indeed. I wonder if @roconnor-blockstream would be interested in having a look at that too...

(Thinking out loud) would it be possible to use Klee to prove the C code that comes out is always correct, independent from the Fiat-crypto proofs? Actually, if that's feasible, perhaps we could try to do that right now with the current C code too.

I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection.

That seems reasonable.

Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we'd probably leave performance on the table.

I certainly wouldn't have non-ADCX-cryptopt-asm as the only x86_64 asm code in the library longer term; that seems like a waste. But if we don't, what do we do for non-ADCX systems in the runtime autodetection world?

Fall back to the C code implementation (and deleting the existing ASM code)?
Fall back to the existing non-proven ASM code?
Replace the existing ASM code with a new one that incorporates #810 (e.g. by taking compiler output from the fiat-crypto C code)?

sipa commented 1 year ago

@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.

dderjoel commented 1 year ago

Replacing the pure-C code with Fiat-Crypto generated C code.

I agree that we should add Fiat-Crypto first separately.

Will create a PR later which only replaces the 64bit C code. We can then continue to discuss there.

maybe with a simple abort at run time, e.g., just call ADX in the self-check

That seems reasonable.

Would that be in src/selftest.h:29? something like

static int secp256k1_selftest_passes(void) {
    int ret = 1;
#if defined(USE_ASM_X86_64)     
  __asm__ __volatile__("cpuid\n"
                       "mov %%rbx, %%rax\n"
                       "shr $19, %%rbx\n"
                       "shr $8, %%rax\n"
                       "and %%rbx, %%rax\n"
                       "and $1, %%rax\n"
                       : "=a"(ret)
                       : "a"(7), "c"(0)
                       : "rdx", "rbx");
#fi
    return secp256k1_selftest_sha256() && ret;
}

(but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO)

Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.

Modify CryptoOpt to not use the new instructions,

As @real-or-random already pointed out, that is a bit more complex, see also Issue 143 in the CryptOpt repo, where I explain why.

No autodetection [...], but it won't see widespread use in this mode.

Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.

Have runtime autodetection

Yes, this is quite common in other libraries, too, like openSSL. If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it's a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))

@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.

That is a question for Owen and the Fiat-Crypto team. So if I look at the current 32-bit implementation, and look at the limbs I think its 10 limbs and the last limb has a width of 22 bits.

When I call ./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273' It fails with

Computed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])).
The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]

and a lot of intermediate code.

On the CryptOpt side, I have not tried incorporating 32-bit code yet. This is probably another engineering task, probably at the order of "get CryptOpt to generate code without ADX / BMI2", as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.

OwenConoly commented 1 year ago

When I call ./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273' It fails with
Computed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])).
The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.

dderjoel commented 1 year ago

Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.

Thanks for looking into that, Owen. FWIW, I'm not 100% sure if my invocation is correct, it was just an (somewhat educated) guess.

dderjoel commented 1 year ago

In inline asm blocks, all variable templates fall in one of these categories:

Input variables

Early-clobber output variable (=&)

Output variables (=)

Input-output variables (+)

(not actually variables) explicitly clobbered registers (which won't be assigned to input or output variables)

Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won't.

If there is a PR, I'm happy to look over it.

So I've created a branch only-asm which features the asm implementation here. lines 202 and 203 are the instructions to restore r and b into rdi and rdx respectively. That is not always necessary, and also did not run through the equivalence checker. As it it sometimes dead operations, depending on when/how the code is used, it would be more elegant to tell CC that those are simply clobbered. I'm also unsure with all the memory read/write, if they are needed.

Now I'm having a hard time to sort "D"(r) and "d"(b) into your description above.

sipa commented 1 year ago

@dderjoel

Would that be in src/selftest.h:29? something like ...

I think @real-or-random meant something even simpler (just some asm code that uses adc, which presumbly crashes on non-supporting systems), but your suggestion is better (and already introduces detection logic which can be later moved from selftesting to autodetection once that is feasible).

Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.

Great.

As @real-or-random already pointed out, that is a bit more complex, see also https://github.com/0xADE1A1DE/CryptOpt/issues/143, where I explain why.

Ah, I see now why it's nontrivial to avoid those instructions. I assumed you already had support for mul/imul, and it was just a matter of disabling mulx... but I get why mul/imul are actually a lot harder to deal with than mulx.

In either case, we still do have the option of replacing the asm code with fiat-crypto-c-fed-through-c-compiler, if we find a particularly fast combination of compiler/version/options.

Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.

I see where you're coming from now, but no, there is absolutely no assumption that the compilation is for the native system. You tell the libsecp256k1 build system what compiler suite to use, and it targets whatever that compiler is for. By default, that's your standard system compiler, which is most likely for the same architecture (x86/x86_64/arm32/aarch64/...) as the system you're running on, but there is no requirement for that. And even if it's the same architecture, it's not necessarily for your own system. In fact, I believe that there is a significant fraction of libsecp256k1 use that is built through cross-compiling (as compilation for another system than your own is called), as Bitcoin Core's release binaries (which include libsecp256k1) are created on a Linux x86_64 environment, cross-compiled for everything else (including arm32, aarch64, windows x86_64, mac osx, ...).

We absolutely can't do autodetection at compile or configure time. The options are manual configuration, or runtime autodetection.

If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it's a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))

Indeed. That's the advantage of using formal methods, that they lower the review barrier significantly for having multiple options like this.

It'd also open the door for things like #967, which adds a field with a more standard 4x64 limb representation, but which is only faster with asm optimization (because it much more crucially relies on fast carries, which are hard to get the compiler to emit for pure C code).

This is probably another engineering task, probably at the order of "get CryptOpt to generate code without ADX / BMI2", as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.c

32-bit x86 is all but dead, I absolutely wouldn't suggest you spend any time on asm optimizing that.

The 32-bit C code is there for weak hardware, from hardware wallet to (old) Raspberry Pi like devices. But none of those are x86. If you're looking for more work, 32-bit ARM and (increasingly) 64-bit ARM are far more interesting targets than 32-bit x86.

@OwenConoly

Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.

Thank you for looking into it!

dderjoel commented 1 year ago

which presumbly crashes on non-supporting systems

Yes, should trigger an SIGILL in those cases.

re, build:

Yes, I forgot about the cross compilation and releases. in the only-asm branch, I've hooked the detection into the case where the build script is checking anyway, and is conservative to not use the asm then. In other words, where it used to check if it could use x86_64, it now additionally checks if the flags are set. However, now I see that this is not enough. For the case, one wants to build for generic x64, one would set --with-asm=x86-64 and this would result in code which needs the flags. So as an alternative, we could use additional switches --has-adx and --has-bmi2 to ./configure, and then if (regardless of auto-detected or explicitly set) also --with-asm=x86-64 is set, then the new asm can be used. Question then is, what the default for this should be (also auto detect or conservative to false?)

If we want to use runtime detection-and-selection, then I believe either the functions must be non-inlined and in e.g. here we could check a global bit which one to jump to, either the adx-asm version or to the C-compiled fallback.

peterdettman commented 1 year ago

Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is.

Possibly relevant: the current 10x26 field mul/sqr leave their residual carry in limbs[2] (thus exceeding 26 bits), while the most-significant limb will be fully reduced (22 bits). The 5x52 (C code, not asm) leaves theirs in the most-significant limb (limbs[4]).

See this PR: https://github.com/bitcoin-core/secp256k1/pull/815 , in particular the first commit: https://github.com/bitcoin-core/secp256k1/pull/815/commits/9388b77b03c6e0c09a950e84ba34e9a2d8e0d433 . In that PR @sipa also anticipates that this may be better for fiat-crypto integration: https://github.com/bitcoin-core/secp256k1/pull/815#issuecomment-1540331544 .

I have previously mentioned a side-benefit to our magnitude tracking from rectifying this: https://github.com/bitcoin-core/secp256k1/issues/1060#issuecomment-1009647883 ("Secondly...")

dderjoel commented 1 year ago

Okay, I'm more than happy to do the work and let the core team here review, but I'd need some corners with some guidance, because I'm not 100% sure of what we want. Currently, I see two options:

At compile time include both, the asm version and the C version, and at runtime check the necessary bits and decide which one to use.
At compile time decide (with switches would we want, what would defaults be?) which one to compile and then have the check in the self-test (sort of what we have now in my fork's only-asm branch, maybe needs some more verbose error message.)

sipa commented 1 year ago

@dderjoel

I think eventually we want runtime check and proper dispatch to the right version, having both compiled in. But at this stage, I think it's more interesting to just have a configure flag that enables either the asm or C (defaulting to C code), plus a selfcheck.

This allows experimentation and benchmarking and testing, but avoids the need to make whatever restructuring is needed for autodetection. Moreover, it makes it easier for someone other than you do make the autodetection happen, as all the actual code would already be in.

dderjoel commented 1 year ago

Just created a PR for exactly this, can I trigger CI without removing the Draft status? EDIT: never mind, just a minute to trigger itself

dderjoel commented 1 year ago

I've re-run the sec_bench suite on the exact fork on my 10 test machines, and the results is still very similar to the one before.

implementation	default_asm	default_c	default_c52	fiat_c	fiat_cryptopt
ecmult_gen	15.8059	14.9232	15.7527	15.0524	14.3984
ecmult_const	30.0159	28.1553	30.2437	28.3729	26.7443
ecmult_1p	23.7137	21.99	23.5329	22.3725	20.9531
ecmult_0p_g	16.851	15.7264	16.8364	15.926	14.9667
ecmult_1p_g	13.8695	12.8423	13.7058	12.9949	12.2609
field_half	0.00246964	0.00245409	0.0024321	0.00240483	0.00248599
field_normalize	0.00738854	0.00741554	0.0073913	0.00737073	0.00733137
field_normalize_weak	0.00294516	0.00294647	0.00294284	0.0029337	0.00292048
field_sqr	0.0141365	0.0125201	0.0137808	0.0123229	0.0118652
field_mul	0.0170084	0.01443	0.0156856	0.014615	0.0145313
field_inverse	1.40032	1.39717	1.39222	1.39458	1.61234
field_inverse_var	0.912255	0.913256	0.908407	0.91439	0.909508
field_is_square_var	1.194	1.20539	1.20361	1.20455	1.19346
field_sqrt	3.87147	3.55153	3.83592	3.44752	3.34927

OwenConoly commented 1 year ago

@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.

An update on this: as Joel mentioned, using the same template we used for the 64-bit code did not work to generate the 32-bit code. There are two reasons that it didn't work.

When writing the 64-bit template, I hadn't considered the fact that we might need to split modular reductions into pieces, like this. Note that we have both R1 and R0, instead of just a single constant R like in the 64-bit code. It was simple enough to make this generalization, so this isn't a problem anymore.
Conceptually, the 32-bit and 64-bit algorithms are almost identical---but there's one difference. In the 64-bit mul, the first reduction is from p8 (the highest partial product). So, naively extrapolating, I would guess that in the 32-bit mul, the first reduction would be from p18 (the highest partial product). If we try to generate the 32-bit mul and square using Joel's command-line invocation above, the generated code makes this "naive extrapolation" and starts out with a reduction from p18. This doesn't work nicely at all: in the end, the value in the most significant limb ends up (potentially) being too big to even fit in a 32-bit register. The handwritten 32-bit code avoids this issue by starting out with a reduction from p17, before moving on to the reduction from p18. The main difference between the generated code and the handwritten code is just that, in the handwritten code, this reduction from p17 appears at the beginning instead of (i.e., not in addition to) later in the function, as in the generated code. This is basically the only conceptual difference.

I am curious whether there is a simple explanation (other than "it makes the bounds work out right") for why we need to start the 32-bit mul with this reduction from p17, but we don't need to start the 64-bit mul with a reduction from p7. Ideally, I'd like to be able to answer questions like "Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?" Of course it isn't vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.

OwenConoly commented 1 year ago

I overcame issue (2) above simply by creating a Coq implementation for the 32-bit mul which is separate from the implementation of the 64-bit mul. I followed along the C implementation from the first commit 9388b77 of #815, which @peterdettman mentioned above.

Here's the C code I generated for the 32-bit mul and sqr: secp256k1_dettman_32.c. Would you be interested in including this code in the library as well? I assume we'd want some benchmarks before considering that too seriously, and I gather from the discussion in #815 that not many people have 32-bit machines on which to benchmark things like this.

Note that the generated code does not include the Karatsuba optimization from the third commit of #815. I think it would be interesting to add the Karatsuba multiplication to fiat-crypto, so I would be willing to add that optimization as well.

And the 32-bit code I generated does incorporate the optimization from #810. The handwritten 32-bit code from #815 also has this optimization.

peterdettman commented 1 year ago

I am curious whether there is a simple explanation (other than "it makes the bounds work out right") for why we need to start the 32-bit mul with this reduction from p17, but we don't need to start the 64-bit mul with a reduction from p7. Ideally, I'd like to be able to answer questions like "Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?" Of course it isn't vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.

Well it pretty much is just "it makes the bounds work out right", but the crux is which of the high partial products can we do last so that our residual carry lands neatly in the most-significant-limb. We then start one higher than that.

For 10x26: p18 produces a 91 bit reduced result that doesn't fit in r8, r9 i.e. 49 bits. p17 produces a 95 bit reduced result that doesn't fit in r7, r8, r9 i.e. 75 bits. p16 produces a 98 bit reduced result that does fit in r6, r7, r8, r9 i.e. 101 bits.

So our best option for the last partial products step is p6/p16, and a final carry chain landing in r9 as desired.

For 5x52: p8 produces a 143 bit reduced result that doesn't fit in r3, r4 i.e. 101 bits. p7 produces a 147 bit reduced result that does fit in r2, r3, r4 i.e. 153 bits.

So our best option for the last partial products step is p2/p7, and a final carry chain landing in r4 as desired.

For our field there's (conservatively) an extra bit coming from the lower partial product in each pair and the carry-in (that I ignored above), but for a field with a prime very (very) close to a power of 2, the lower partial product would need to be included in the analysis properly.

Regarding 16-bit, I doubt a practical algorithm would try to follow the pattern of these implementations. If a reduced radix is even tenable I would guess it would have to use Karatsuba, perhaps with 20 limbs in four groups of 64 bits: (13 13 13 13 12). (Note that this is already incompatible with the current assumption of 4 extra bits for carry-free linear field ops).

peterdettman commented 1 year ago

Also, +1 in principle to including also the 32-bit code.

bitcoin-core / secp256k1

fiat-crypto + CryptOpt tracking issue #1261

Geometric mean over nine machines (per-machine tables at the end of the post)

Intel Core i7-6770HQ

Intel Core i7-10710U

Intel Core i9-10900K

Intel Core i7-11700KF

Intel Core i9-13900KF

AMD Ryzen Theadripper 1900X

AMD Ryzen 5800X

AMD Ryzen 5950X

AMD Ryzen 7950X

Supercop compiler settings

output of `./configure` on i7-11700K, `./fiat_cryptopt`

Geometric mean over ~nine~ ten machines (per-machine tables at the end of the post)

Intel Core i7-6770HQ

Intel Core i7-10710U

Intel Core i9-10900K

Intel Core i7-11700KF

Intel Core i9-12900KF

Intel Core i9-13900KF

AMD Ryzen Theadripper 1900X

AMD Ryzen 5800X

AMD Ryzen 5950X

AMD Ryzen 7950X

bitcoin-core / secp256k1

fiat-crypto + CryptOpt tracking issue #1261

Geometric mean over nine machines (per-machine tables at the end of the post)

Intel Core i7-6770HQ

Intel Core i7-10710U

Intel Core i9-10900K

Intel Core i7-11700KF

Intel Core i9-13900KF

AMD Ryzen Theadripper 1900X

AMD Ryzen 5800X

AMD Ryzen 5950X

AMD Ryzen 7950X

Supercop compiler settings

output of ./configure on i7-11700K, ./fiat_cryptopt

Geometric mean over ~nine~ ten machines (per-machine tables at the end of the post)

Intel Core i7-6770HQ

Intel Core i7-10710U

Intel Core i9-10900K

Intel Core i7-11700KF

Intel Core i9-12900KF

Intel Core i9-13900KF

AMD Ryzen Theadripper 1900X

AMD Ryzen 5800X

AMD Ryzen 5950X

AMD Ryzen 7950X

output of `./configure` on i7-11700K, `./fiat_cryptopt`