Open greg7mdp opened 2 years ago
Preliminary benchmarks:
ecmod<addmod_public>_mean +0.0584 +0.0584 7 7 7 7
ecmod<addmod_simple>_mean -0.0017 -0.0017 22 22 22 22
ecmod<addmod_prenormalize>_mean -0.0999 -0.0999 5 5 5 5
ecmod<addmod_daosvik_v1>_mean +0.0210 +0.0210 12 12 12 12
ecmod<addmod_daosvik_v2>_mean +0.0103 +0.0103 6 6 6 6
ecmod<mulmod>_mean +0.0051 +0.0051 68 68 68 68
shift<uint256, uint256, shl_public>/-1_mean -0.0789 -0.0789 3 2 3 2
shift<uint256, uint256, shl_public>/0_mean -0.0716 -0.0716 3 3 3 3
shift<uint256, uint256, shl_public>/1_mean -0.0020 -0.0020 3 3 3 3
shift<uint256, uint256, shl_public>/2_mean -0.0885 -0.0885 3 2 3 2
shift<uint256, uint256, shl_public>/3_mean -0.1080 -0.1080 2 2 2 2
shift<uint256, uint256, shl_halves>/-1_mean +0.0045 +0.0045 2 2 2 2
shift<uint256, uint256, shl_halves>/0_mean +0.0012 +0.0012 3 3 3 3
shift<uint256, uint256, shl_halves>/1_mean +0.0016 +0.0016 3 3 3 3
shift<uint256, uint256, shl_halves>/2_mean +0.0004 +0.0004 2 2 2 2
shift<uint256, uint256, shl_halves>/3_mean -0.0006 -0.0006 2 2 2 2
shift<uint256, uint64_t, shl_public>/-1_mean -0.0052 -0.0052 2 2 2 2
shift<uint256, uint64_t, shl_public>/0_mean +0.0021 +0.0021 3 3 3 3
shift<uint256, uint64_t, shl_public>/1_mean +0.0002 +0.0002 2 2 2 2
shift<uint256, uint64_t, shl_public>/2_mean -0.0959 -0.0959 2 2 2 2
shift<uint256, uint64_t, shl_public>/3_mean +0.0070 +0.0070 2 2 2 2
shift<uint256, uint64_t, shl_halves>/-1_mean -0.0030 -0.0030 2 2 2 2
shift<uint256, uint64_t, shl_halves>/0_mean -0.0022 -0.0022 3 3 3 3
shift<uint256, uint64_t, shl_halves>/1_mean -0.0017 -0.0017 2 2 2 2
shift<uint256, uint64_t, shl_halves>/2_mean -0.0785 -0.0785 2 2 2 2
shift<uint256, uint64_t, shl_halves>/3_mean +0.0006 +0.0006 2 2 2 2
shift<uint256, uint64_t, shl_llvm>/-1_mean -0.0057 -0.0057 6 6 6 6
shift<uint256, uint64_t, shl_llvm>/0_mean -0.0049 -0.0049 6 6 6 6
shift<uint256, uint64_t, shl_llvm>/1_mean -0.0056 -0.0056 6 6 6 6
shift<uint256, uint64_t, shl_llvm>/2_mean -0.0073 -0.0073 6 6 6 6
shift<uint256, uint64_t, shl_llvm>/3_mean -0.0053 -0.0053 6 6 6 6
shift<uint512, uint512, shl_public>/-1_mean +0.0005 +0.0005 9 9 9 9
shift<uint512, uint512, shl_public>/0_mean +0.0008 +0.0008 10 10 10 10
shift<uint512, uint512, shl_public>/1_mean -0.0046 -0.0046 9 9 9 9
shift<uint512, uint512, shl_public>/2_mean -0.0028 -0.0028 9 9 9 9
shift<uint512, uint512, shl_public>/3_mean +0.0006 +0.0006 8 8 8 8
shift<uint512, uint64_t, shl_public>/-1_mean -0.0081 -0.0081 8 8 8 8
shift<uint512, uint64_t, shl_public>/0_mean +0.0538 +0.0538 8 9 8 9
shift<uint512, uint64_t, shl_public>/1_mean -0.0263 -0.0263 8 8 8 8
shift<uint512, uint64_t, shl_public>/2_mean -0.0286 -0.0286 8 8 8 8
shift<uint512, uint64_t, shl_public>/3_mean -0.0297 -0.0297 8 8 8 8
compare<lt_public>/64_mean -0.0021 -0.0021 2 2 2 2
compare<lt_public>/128_mean -0.1423 -0.1423 2 2 2 2
compare<lt_public>/192_mean +0.0196 +0.0196 2 2 2 2
compare<lt_public>/256_mean -0.1432 -0.1432 2 1 2 1
compare<lt_sub>/64_mean +0.0092 +0.0092 2 2 2 2
compare<lt_sub>/128_mean +0.0044 +0.0044 2 2 2 2
compare<lt_sub>/192_mean +0.0062 +0.0062 2 2 2 2
compare<lt_sub>/256_mean +0.0078 +0.0078 2 2 2 2
compare<lt_wordcmp>/64_mean +0.0009 +0.0009 2 2 2 2
compare<lt_wordcmp>/128_mean +0.0024 +0.0024 2 2 2 2
compare<lt_wordcmp>/192_mean -0.0349 -0.0349 4 4 4 4
compare<lt_wordcmp>/256_mean +0.0308 +0.0308 3 4 3 4
compare<lt_halves>/64_mean -0.0025 -0.0025 3 3 3 3
compare<lt_halves>/128_mean -0.0021 -0.0021 3 3 3 3
compare<lt_halves>/192_mean -0.0018 -0.0018 3 3 3 3
compare<lt_halves>/256_mean -0.0031 -0.0031 3 3 3 3
compare<lt_llvm>/64_mean -0.0127 -0.0127 2 2 2 2
compare<lt_llvm>/128_mean -0.0117 -0.0117 2 2 2 2
compare<lt_llvm>/192_mean -0.0109 -0.0109 2 2 2 2
compare<lt_llvm>/256_mean -0.0115 -0.0115 2 2 2 2
So this looks solid for compare operators directly and seems to help the shifts as well, although this may be code layout effect.
Thanks, glad to hear :-)
Confirmed, this does not affect assembly of shift operators. Proof: https://godbolt.org/z/4b6cT4rje.
Yes, makes sense. When I played with this repo a while back, I noticed that the benchmarks had variability from one run to the next, so I'm not sure how reliable the comparison truly is.
I do believe that my other changes in add, sub
were also beneficial.
Merging #269 (f977136) into master (e243bfb) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #269 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 10 10
Lines 1941 1937 -4
=========================================
- Hits 1941 1937 -4
Flag | Coverage Δ | |
---|---|---|
32bit | 100.00% <100.00%> (ø) |
|
gcc | 99.47% <100.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
include/intx/intx.hpp | 100.00% <100.00%> (ø) |
@chfast should this be merged?
On the latest compilers I don't see much difference so I'm not very much in favor because this implementation has 4 branches while the original one is branchless. But people are welcome to drop their own benchmarking results (use "compare" filter).
Why does it matter that it is branchless if it is slower?
I will try to clean up the situation. We are discussing 3 "less than" implementations:
Here you can see the assembly output for each: https://godbolt.org/z/WY3qqzrET Compilers are doing good job.
In the worst case every variant need to load all 8 words. Additionally, "sub" executes 4 subtract instructions (independent of input). The "ne" executes 4 cmp+jmp
instead. These 2 instructions are fused to single micro-instruction. So it should perform the same as "sub" in worst case but may perform better if it can "exit" earlier.
I added all variants of implementations in #277 and benchmarked them on two CPUs. There are 5 benchmark cases:
compare<[lt_split vs. lt_ne]>/0_mean +0.1471 +0.1471 1 2 1 2
compare<[lt_split vs. lt_ne]>/64_mean -0.0009 -0.0009 2 2 2 2
compare<[lt_split vs. lt_ne]>/128_mean -0.0009 -0.0009 2 2 2 2
compare<[lt_split vs. lt_ne]>/192_mean +0.0013 +0.0013 1 1 1 1
compare<[lt_split vs. lt_ne]>/256_mean +0.2908 +0.2908 1 2 1 2
compare<[lt_split vs. lt_sub]>/0_mean -0.0345 -0.0345 1 1 1 1
compare<[lt_split vs. lt_sub]>/64_mean -0.0827 -0.0827 2 1 2 1
compare<[lt_split vs. lt_sub]>/128_mean -0.0818 -0.0818 2 1 2 1
compare<[lt_split vs. lt_sub]>/192_mean +0.0992 +0.0992 1 1 1 1
compare<[lt_split vs. lt_sub]>/256_mean +0.0978 +0.0978 1 1 1 1
compare<[lt_sub vs. lt_ne]>/0_mean +0.1882 +0.1882 1 2 1 2
compare<[lt_sub vs. lt_ne]>/64_mean +0.0892 +0.0892 1 2 1 2
compare<[lt_sub vs. lt_ne]>/128_mean +0.0881 +0.0881 1 2 1 2
compare<[lt_sub vs. lt_ne]>/192_mean -0.0890 -0.0890 1 1 1 1
compare<[lt_sub vs. lt_ne]>/256_mean +0.1758 +0.1758 1 2 1 2
compare<[lt_split vs. lt_ne]>/0_mean +0.0854 +0.0854 2 2 2 2
compare<[lt_split vs. lt_ne]>/64_mean +0.0036 +0.0036 2 2 2 2
compare<[lt_split vs. lt_ne]>/128_mean +0.1674 +0.1674 2 2 2 2
compare<[lt_split vs. lt_ne]>/192_mean -0.0026 -0.0027 2 2 2 2
compare<[lt_split vs. lt_ne]>/256_mean +0.0057 +0.0057 2 2 2 2
compare<[lt_split vs. lt_sub]>/0_mean -0.1562 -0.1562 2 2 2 2
compare<[lt_split vs. lt_sub]>/64_mean -0.0004 -0.0004 2 2 2 2
compare<[lt_split vs. lt_sub]>/128_mean +0.0004 +0.0004 2 2 2 2
compare<[lt_split vs. lt_sub]>/192_mean -0.2086 -0.2086 2 2 2 2
compare<[lt_split vs. lt_sub]>/256_mean -0.2122 -0.2122 2 2 2 2
compare<[lt_sub vs. lt_ne]>/0_mean +0.2863 +0.2863 2 2 2 2
compare<[lt_sub vs. lt_ne]>/64_mean +0.0039 +0.0039 2 2 2 2
compare<[lt_sub vs. lt_ne]>/128_mean +0.1670 +0.1670 2 2 2 2
compare<[lt_sub vs. lt_ne]>/192_mean +0.2602 +0.2602 2 2 2 2
compare<[lt_sub vs. lt_ne]>/256_mean +0.2766 +0.2766 2 2 2 2
This is noisy machine with AMD CPU. For this case I used lt_ne2
for comparison because it was 15% better than the lt_ne
.
compare<[lt_split vs. lt_ne2]>/0_mean -0.1398 -0.1397 4 3 4 3
compare<[lt_split vs. lt_ne2]>/64_mean -0.2668 -0.2668 3 2 3 2
compare<[lt_split vs. lt_ne2]>/128_mean -0.1391 -0.1392 3 3 3 3
compare<[lt_split vs. lt_ne2]>/192_mean -0.1564 -0.1564 3 3 3 3
compare<[lt_split vs. lt_ne2]>/256_mean -0.1487 -0.1487 3 3 3 3
compare<[lt_split vs. lt_sub]>/0_mean -0.2012 -0.2010 4 3 4 3
compare<[lt_split vs. lt_sub]>/64_mean -0.1363 -0.1366 3 3 3 3
compare<[lt_split vs. lt_sub]>/128_mean -0.1508 -0.1507 3 3 3 3
compare<[lt_split vs. lt_sub]>/192_mean -0.1555 -0.1555 3 3 3 3
compare<[lt_split vs. lt_sub]>/256_mean -0.1456 -0.1456 3 3 3 3
compare<[lt_sub vs. lt_ne2]>/0_mean +0.0769 +0.0768 3 3 3 3
compare<[lt_sub vs. lt_ne2]>/64_mean -0.1511 -0.1508 3 2 3 2
compare<[lt_sub vs. lt_ne2]>/128_mean +0.0137 +0.0136 3 3 3 3
compare<[lt_sub vs. lt_ne2]>/192_mean -0.0010 -0.0010 3 3 3 3
compare<[lt_sub vs. lt_ne2]>/256_mean -0.0036 -0.0036 3 3 3 3
compare<[lt_split vs. lt_ne]>/0_mean +0.2017 +0.2016 2 2 2 2
compare<[lt_split vs. lt_ne]>/64_mean -0.0078 -0.0073 2 2 2 2
compare<[lt_split vs. lt_ne]>/128_mean +0.3924 +0.3924 2 2 2 2
compare<[lt_split vs. lt_ne]>/192_mean -0.0017 -0.0016 2 2 2 2
compare<[lt_split vs. lt_ne]>/256_mean -0.0009 -0.0009 2 2 2 2
compare<[lt_split vs. lt_sub]>/0_mean -0.0803 -0.0803 2 2 2 2
compare<[lt_split vs. lt_sub]>/64_mean -0.0051 -0.0046 2 2 2 2
compare<[lt_split vs. lt_sub]>/128_mean -0.0021 -0.0020 2 2 2 2
compare<[lt_split vs. lt_sub]>/192_mean -0.0025 -0.0025 2 2 2 2
compare<[lt_split vs. lt_sub]>/256_mean -0.0023 -0.0023 2 2 2 2
compare<[lt_sub vs. lt_ne]>/0_mean +0.3065 +0.3065 2 2 2 2
compare<[lt_sub vs. lt_ne]>/64_mean -0.0027 -0.0027 2 2 2 2
compare<[lt_sub vs. lt_ne]>/128_mean +0.3953 +0.3953 2 2 2 2
compare<[lt_sub vs. lt_ne]>/192_mean +0.0009 +0.0009 2 2 2 2
compare<[lt_sub vs. lt_ne]>/256_mean +0.0015 +0.0015 2 2 2 2
From the results we should rather consider switching back to "sub" implementation. I may do some other benchmarks if I have time.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
11 Code Smells
No Coverage information
0.0% Duplication
@chfast, let me know what you find out please :-)