greg7mdp commented 2 years ago

@chfast, let me know what you find out please :-)

chfast commented 2 years ago

Preliminary benchmarks:

ecmod<addmod_public>_mean                                            +0.0584         +0.0584             7             7             7             7                                                                              
ecmod<addmod_simple>_mean                                            -0.0017         -0.0017            22            22            22            22                                                                              
ecmod<addmod_prenormalize>_mean                                      -0.0999         -0.0999             5             5             5             5                                                                              
ecmod<addmod_daosvik_v1>_mean                                        +0.0210         +0.0210            12            12            12            12                                                                              
ecmod<addmod_daosvik_v2>_mean                                        +0.0103         +0.0103             6             6             6             6                                                                              
ecmod<mulmod>_mean                                                   +0.0051         +0.0051            68            68            68            68                                                                              
shift<uint256, uint256, shl_public>/-1_mean                          -0.0789         -0.0789             3             2             3             2
shift<uint256, uint256, shl_public>/0_mean                           -0.0716         -0.0716             3             3             3             3
shift<uint256, uint256, shl_public>/1_mean                           -0.0020         -0.0020             3             3             3             3
shift<uint256, uint256, shl_public>/2_mean                           -0.0885         -0.0885             3             2             3             2
shift<uint256, uint256, shl_public>/3_mean                           -0.1080         -0.1080             2             2             2             2
shift<uint256, uint256, shl_halves>/-1_mean                          +0.0045         +0.0045             2             2             2             2
shift<uint256, uint256, shl_halves>/0_mean                           +0.0012         +0.0012             3             3             3             3
shift<uint256, uint256, shl_halves>/1_mean                           +0.0016         +0.0016             3             3             3             3
shift<uint256, uint256, shl_halves>/2_mean                           +0.0004         +0.0004             2             2             2             2
shift<uint256, uint256, shl_halves>/3_mean                           -0.0006         -0.0006             2             2             2             2
shift<uint256, uint64_t, shl_public>/-1_mean                         -0.0052         -0.0052             2             2             2             2
shift<uint256, uint64_t, shl_public>/0_mean                          +0.0021         +0.0021             3             3             3             3
shift<uint256, uint64_t, shl_public>/1_mean                          +0.0002         +0.0002             2             2             2             2
shift<uint256, uint64_t, shl_public>/2_mean                          -0.0959         -0.0959             2             2             2             2
shift<uint256, uint64_t, shl_public>/3_mean                          +0.0070         +0.0070             2             2             2             2
shift<uint256, uint64_t, shl_halves>/-1_mean                         -0.0030         -0.0030             2             2             2             2
shift<uint256, uint64_t, shl_halves>/0_mean                          -0.0022         -0.0022             3             3             3             3
shift<uint256, uint64_t, shl_halves>/1_mean                          -0.0017         -0.0017             2             2             2             2
shift<uint256, uint64_t, shl_halves>/2_mean                          -0.0785         -0.0785             2             2             2             2
shift<uint256, uint64_t, shl_halves>/3_mean                          +0.0006         +0.0006             2             2             2             2
shift<uint256, uint64_t, shl_llvm>/-1_mean                           -0.0057         -0.0057             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/0_mean                            -0.0049         -0.0049             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/1_mean                            -0.0056         -0.0056             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/2_mean                            -0.0073         -0.0073             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/3_mean                            -0.0053         -0.0053             6             6             6             6
shift<uint512, uint512, shl_public>/-1_mean                          +0.0005         +0.0005             9             9             9             9
shift<uint512, uint512, shl_public>/0_mean                           +0.0008         +0.0008            10            10            10            10
shift<uint512, uint512, shl_public>/1_mean                           -0.0046         -0.0046             9             9             9             9
shift<uint512, uint512, shl_public>/2_mean                           -0.0028         -0.0028             9             9             9             9
shift<uint512, uint512, shl_public>/3_mean                           +0.0006         +0.0006             8             8             8             8
shift<uint512, uint64_t, shl_public>/-1_mean                         -0.0081         -0.0081             8             8             8             8
shift<uint512, uint64_t, shl_public>/0_mean                          +0.0538         +0.0538             8             9             8             9
shift<uint512, uint64_t, shl_public>/1_mean                          -0.0263         -0.0263             8             8             8             8
shift<uint512, uint64_t, shl_public>/2_mean                          -0.0286         -0.0286             8             8             8             8
shift<uint512, uint64_t, shl_public>/3_mean                          -0.0297         -0.0297             8             8             8             8
compare<lt_public>/64_mean                                           -0.0021         -0.0021             2             2             2             2
compare<lt_public>/128_mean                                          -0.1423         -0.1423             2             2             2             2
compare<lt_public>/192_mean                                          +0.0196         +0.0196             2             2             2             2
compare<lt_public>/256_mean                                          -0.1432         -0.1432             2             1             2             1
compare<lt_sub>/64_mean                                              +0.0092         +0.0092             2             2             2             2
compare<lt_sub>/128_mean                                             +0.0044         +0.0044             2             2             2             2
compare<lt_sub>/192_mean                                             +0.0062         +0.0062             2             2             2             2
compare<lt_sub>/256_mean                                             +0.0078         +0.0078             2             2             2             2
compare<lt_wordcmp>/64_mean                                          +0.0009         +0.0009             2             2             2             2
compare<lt_wordcmp>/128_mean                                         +0.0024         +0.0024             2             2             2             2
compare<lt_wordcmp>/192_mean                                         -0.0349         -0.0349             4             4             4             4
compare<lt_wordcmp>/256_mean                                         +0.0308         +0.0308             3             4             3             4
compare<lt_halves>/64_mean                                           -0.0025         -0.0025             3             3             3             3
compare<lt_halves>/128_mean                                          -0.0021         -0.0021             3             3             3             3
compare<lt_halves>/192_mean                                          -0.0018         -0.0018             3             3             3             3
compare<lt_halves>/256_mean                                          -0.0031         -0.0031             3             3             3             3
compare<lt_llvm>/64_mean                                             -0.0127         -0.0127             2             2             2             2
compare<lt_llvm>/128_mean                                            -0.0117         -0.0117             2             2             2             2
compare<lt_llvm>/192_mean                                            -0.0109         -0.0109             2             2             2             2
compare<lt_llvm>/256_mean                                            -0.0115         -0.0115             2             2             2             2

So this looks solid for compare operators directly and seems to help the shifts as well, although this may be code layout effect.

greg7mdp commented 2 years ago

Thanks, glad to hear :-)

chfast commented 2 years ago

Confirmed, this does not affect assembly of shift operators. Proof: https://godbolt.org/z/4b6cT4rje.

greg7mdp commented 2 years ago

Yes, makes sense. When I played with this repo a while back, I noticed that the benchmarks had variability from one run to the next, so I'm not sure how reliable the comparison truly is. I do believe that my other changes in add, sub were also beneficial.

codecov-commenter commented 2 years ago

Codecov Report

Merging #269 (f977136) into master (e243bfb) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #269   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines         1941      1937    -4     
=========================================
- Hits          1941      1937    -4

Flag	Coverage Δ
32bit	`100.00% <100.00%> (ø)`
gcc	`99.47% <100.00%> (-0.01%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
include/intx/intx.hpp	`100.00% <100.00%> (ø)`

axic commented 2 years ago

@chfast should this be merged?

chfast commented 2 years ago

On the latest compilers I don't see much difference so I'm not very much in favor because this implementation has 4 branches while the original one is branchless. But people are welcome to drop their own benchmarking results (use "compare" filter).

greg7mdp commented 2 years ago

Why does it matter that it is branchless if it is slower?

chfast commented 2 years ago

I will try to clean up the situation. We are discussing 3 "less than" implementations:

"sub" - subtract two values and return the borrow flag,
"ne" - find the top pair of not-equal words, return comparison of these; proposed here by @greg7mdp,
"split" - the current one which is the mixture of the two above: split into two 128-bit values and subtract them.

Here you can see the assembly output for each: https://godbolt.org/z/WY3qqzrET Compilers are doing good job.

Analysis

In the worst case every variant need to load all 8 words. Additionally, "sub" executes 4 subtract instructions (independent of input). The "ne" executes 4 cmp+jmp instead. These 2 instructions are fused to single micro-instruction. So it should perform the same as "sub" in worst case but may perform better if it can "exit" earlier.

Benchmarks

I added all variants of implementations in #277 and benchmarked them on two CPUs. There are 5 benchmark cases:

64, 128, 192, 256 mean the value size in bits. E.g. 64 means we are comparing two 64 bit numbers (i.e. 3 top words are zeros).
0 equal combination of above cases, shuffled. There are also some internal changes to benchmarks to improve stability of results.

Haswell

compare<[lt_split vs. lt_ne]>/0_mean                    +0.1471         +0.1471             1             2             1             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  +0.0013         +0.0013             1             1             1             1
compare<[lt_split vs. lt_ne]>/256_mean                  +0.2908         +0.2908             1             2             1             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.0345         -0.0345             1             1             1             1
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0827         -0.0827             2             1             2             1
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0818         -0.0818             2             1             2             1
compare<[lt_split vs. lt_sub]>/192_mean                  +0.0992         +0.0992             1             1             1             1
compare<[lt_split vs. lt_sub]>/256_mean                  +0.0978         +0.0978             1             1             1             1

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.1882         +0.1882             1             2             1             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0892         +0.0892             1             2             1             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.0881         +0.0881             1             2             1             2
compare<[lt_sub vs. lt_ne]>/192_mean                  -0.0890         -0.0890             1             1             1             1
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.1758         +0.1758             1             2             1             2

Skylake

compare<[lt_split vs. lt_ne]>/0_mean                    +0.0854         +0.0854             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   +0.0036         +0.0036             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.1674         +0.1674             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0026         -0.0027             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  +0.0057         +0.0057             2             2             2             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.1562         -0.1562             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0004         -0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  +0.0004         +0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.2086         -0.2086             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.2122         -0.2122             2             2             2             2

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.2863         +0.2863             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0039         +0.0039             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.1670         +0.1670             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.2602         +0.2602             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.2766         +0.2766             2             2             2             2

Zen3

This is noisy machine with AMD CPU. For this case I used lt_ne2 for comparison because it was 15% better than the lt_ne.

compare<[lt_split vs. lt_ne2]>/0_mean                    -0.1398         -0.1397             4             3             4             3
compare<[lt_split vs. lt_ne2]>/64_mean                   -0.2668         -0.2668             3             2             3             2
compare<[lt_split vs. lt_ne2]>/128_mean                  -0.1391         -0.1392             3             3             3             3
compare<[lt_split vs. lt_ne2]>/192_mean                  -0.1564         -0.1564             3             3             3             3
compare<[lt_split vs. lt_ne2]>/256_mean                  -0.1487         -0.1487             3             3             3             3

compare<[lt_split vs. lt_sub]>/0_mean                    -0.2012         -0.2010             4             3             4             3
compare<[lt_split vs. lt_sub]>/64_mean                   -0.1363         -0.1366             3             3             3             3
compare<[lt_split vs. lt_sub]>/128_mean                  -0.1508         -0.1507             3             3             3             3
compare<[lt_split vs. lt_sub]>/192_mean                  -0.1555         -0.1555             3             3             3             3
compare<[lt_split vs. lt_sub]>/256_mean                  -0.1456         -0.1456             3             3             3             3

compare<[lt_sub vs. lt_ne2]>/0_mean                    +0.0769         +0.0768             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/64_mean                   -0.1511         -0.1508             3             2             3             2
compare<[lt_sub vs. lt_ne2]>/128_mean                  +0.0137         +0.0136             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/192_mean                  -0.0010         -0.0010             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/256_mean                  -0.0036         -0.0036             3             3             3             3

Xeon Platinum 8272CL

compare<[lt_split vs. lt_ne]>/0_mean                    +0.2017         +0.2016             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0078         -0.0073             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.3924         +0.3924             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0017         -0.0016             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  -0.0009         -0.0009             2             2             2             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.0803         -0.0803             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0051         -0.0046             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0021         -0.0020             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.0025         -0.0025             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.0023         -0.0023             2             2             2             2

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.3065         +0.3065             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   -0.0027         -0.0027             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.3953         +0.3953             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.0009         +0.0009             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.0015         +0.0015             2             2             2             2

From the results we should rather consider switching back to "sub" implementation. I may do some other benchmarks if I have time.