Open chfast opened 7 months ago
Hey @chfast, thanks for redirecting me from https://github.com/ethereum/evmone/issues/777.
Since I can't open a PR, let me present my work here.
Here is the mul method from the evmmax.hpp
file implementing the improvement of the CIOS variant of the Montgomery algorithm proposed in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
/// Performs a Montgomery modular multiplication.
///
/// Inputs must be in Montgomery form: x = aR, y = bR.
/// This computes Montgomery multiplication xyR⁻¹ % mod what gives aRbRR⁻¹ % mod = abR % mod.
/// The result (abR) is in Montgomery form.
constexpr UintT mul(const UintT& x, const UintT& y) const noexcept
{
// Coarsely Integrated Operand Scanning (CIOS) Method
// Based on 2.3.2 from
// High-Speed Algorithms & Architectures For Number-Theoretic Cryptosystems
// https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
// and on 2.2 from
// EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
// https://eprint.iacr.org/2022/1400.pdf
constexpr auto S = UintT::num_words; // TODO(C++23): Make it static
intx::uint<UintT::num_bits + 64> t;
if (mod[S - 1] < std::numeric_limits<uint64_t>::max() / 2)
{
for (size_t i = 0; i != S; ++i)
{
uint64_t c = 0;
for (size_t j = 0; j != S; ++j)
std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
auto const c_2 = c;
const auto m = t[0] * m_mod_inv;
std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
for (size_t j = 1; j != S; ++j)
std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
t[S - 1] = c_2 + c;
}
}
else
{
for (size_t i = 0; i != S; ++i)
{
uint64_t c = 0;
for (size_t j = 0; j != S; ++j)
std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
auto tmp = intx::addc(t[S], c);
t[S] = tmp.value;
const auto d = tmp.carry; // TODO: Carry is 0 for sparse modulus.
const auto m = t[0] * m_mod_inv;
std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
for (size_t j = 1; j != S; ++j)
std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
tmp = intx::addc(t[S], c);
t[S - 1] = tmp.value;
t[S] = d + tmp.carry; // TODO: Carry is 0 for sparse modulus.
}
}
if (t >= mod)
t -= mod;
return static_cast<UintT>(t);
}
Here are the benchmarks performed after applying these changes starting from commit 01eca779c58c19c8659fd5a6bf0cad85613c629b.
Build: cmake --build build --parallel
Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json
Comparison:
user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json
Comparing cios_classic.json to cios_improved.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_add<uint256, bn254> +0.0247 +0.0245 3 3 3 3
evmmax_add<uint256, bn254> +0.0402 +0.0400 3 4 3 4
evmmax_add<uint256, bn254> -0.0072 -0.0073 4 4 4 4
evmmax_add<uint256, bn254> +0.0150 +0.0149 4 4 4 4
evmmax_add<uint256, bn254> +0.0491 +0.0489 3 4 3 4
evmmax_add<uint256, bn254> +0.0137 +0.0136 3 3 3 3
evmmax_add<uint256, bn254> -0.0018 -0.0021 3 3 3 3
evmmax_add<uint256, bn254> -0.0073 -0.0074 3 3 3 3
evmmax_add<uint256, bn254> -0.0039 -0.0040 3 3 3 3
evmmax_add<uint256, bn254> -0.0287 -0.0290 3 3 3 3
evmmax_add<uint256, bn254>_pvalue 0.6776 0.7337 U Test, Repetitions: 10 vs 10
evmmax_add<uint256, bn254>_mean +0.0092 +0.0091 3 3 3 3
evmmax_add<uint256, bn254>_median +0.0181 +0.0180 3 3 3 3
evmmax_add<uint256, bn254>_stddev +0.2974 +0.2984 0 0 0 0
evmmax_add<uint256, bn254>_cv +0.2855 +0.2868 0 0 0 0
evmmax_add<uint256, secp256k1> +0.0261 +0.0260 3 3 3 3
evmmax_add<uint256, secp256k1> -0.0089 -0.0090 3 3 3 3
evmmax_add<uint256, secp256k1> -0.0105 -0.0107 3 3 3 3
evmmax_add<uint256, secp256k1> -0.0223 -0.0224 3 3 3 3
evmmax_add<uint256, secp256k1> -0.0231 -0.0233 3 3 3 3
evmmax_add<uint256, secp256k1> -0.0136 -0.0138 3 3 3 3
evmmax_add<uint256, secp256k1> +0.0423 +0.0421 3 3 3 3
evmmax_add<uint256, secp256k1> +0.0394 +0.0393 3 3 3 3
evmmax_add<uint256, secp256k1> +0.0058 +0.0057 3 3 3 3
evmmax_add<uint256, secp256k1> +0.0089 +0.0087 3 3 3 3
evmmax_add<uint256, secp256k1>_pvalue 0.6776 0.6776 U Test, Repetitions: 10 vs 10
evmmax_add<uint256, secp256k1>_mean +0.0041 +0.0039 3 3 3 3
evmmax_add<uint256, secp256k1>_median +0.0013 +0.0012 3 3 3 3
evmmax_add<uint256, secp256k1>_stddev -0.2987 -0.2986 0 0 0 0
evmmax_add<uint256, secp256k1>_cv -0.3015 -0.3013 0 0 0 0
evmmax_sub<uint256, bn254> +0.0036 +0.0034 2 2 2 2
evmmax_sub<uint256, bn254> -0.0068 -0.0069 2 2 2 2
evmmax_sub<uint256, bn254> +0.0163 +0.0162 2 2 2 2
evmmax_sub<uint256, bn254> +0.0066 +0.0064 2 2 2 2
evmmax_sub<uint256, bn254> +0.0417 +0.0415 2 2 2 2
evmmax_sub<uint256, bn254> +0.0421 +0.0419 2 2 2 2
evmmax_sub<uint256, bn254> +0.0298 +0.0298 2 2 2 2
evmmax_sub<uint256, bn254> +0.0404 +0.0402 2 2 2 2
evmmax_sub<uint256, bn254> +0.0800 +0.0798 2 2 2 2
evmmax_sub<uint256, bn254> +0.0197 +0.0197 2 2 2 2
evmmax_sub<uint256, bn254>_pvalue 0.0376 0.0452 U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, bn254>_mean +0.0274 +0.0273 2 2 2 2
evmmax_sub<uint256, bn254>_median +0.0273 +0.0272 2 2 2 2
evmmax_sub<uint256, bn254>_stddev +0.9074 +0.9082 0 0 0 0
evmmax_sub<uint256, bn254>_cv +0.8565 +0.8576 0 0 0 0
evmmax_sub<uint256, secp256k1> +0.0005 +0.0003 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0062 -0.0062 2 2 2 2
evmmax_sub<uint256, secp256k1> +0.0020 +0.0018 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0151 -0.0152 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0151 -0.0151 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0508 -0.0509 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0428 -0.0430 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0583 -0.0584 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0616 -0.0617 2 2 2 2
evmmax_sub<uint256, secp256k1> -0.0423 -0.0425 2 2 2 2
evmmax_sub<uint256, secp256k1>_pvalue 0.0091 0.0073 U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, secp256k1>_mean -0.0295 -0.0296 2 2 2 2
evmmax_sub<uint256, secp256k1>_median -0.0305 -0.0306 2 2 2 2
evmmax_sub<uint256, secp256k1>_stddev -0.7974 -0.7983 0 0 0 0
evmmax_sub<uint256, secp256k1>_cv -0.7913 -0.7922 0 0 0 0
evmmax_mul<uint256, bn254> +7.9425 +7.9408 27 243 27 243
evmmax_mul<uint256, bn254> +8.0145 +8.0139 27 241 27 241
evmmax_mul<uint256, bn254> +7.7713 +7.7701 27 236 27 236
evmmax_mul<uint256, bn254> +7.8369 +7.8361 27 237 27 237
evmmax_mul<uint256, bn254> +7.8531 +7.8522 27 235 27 235
evmmax_mul<uint256, bn254> +7.8605 +7.8592 27 237 27 237
evmmax_mul<uint256, bn254> +7.7501 +7.7493 27 235 27 235
evmmax_mul<uint256, bn254> +7.9913 +7.9902 27 242 27 242
evmmax_mul<uint256, bn254> +7.8688 +7.8676 27 238 27 238
evmmax_mul<uint256, bn254> +7.9943 +7.9935 27 240 27 240
evmmax_mul<uint256, bn254>_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean +7.8884 +7.8873 27 238 27 238
evmmax_mul<uint256, bn254>_median +7.8595 +7.8584 27 237 27 237
evmmax_mul<uint256, bn254>_stddev +17.5405 +17.5057 0 3 0 3
evmmax_mul<uint256, bn254>_cv +1.0859 +1.0823 0 0 0 0
evmmax_mul<uint256, secp256k1> +7.5847 +7.5839 30 256 30 256
evmmax_mul<uint256, secp256k1> +7.5657 +7.5643 30 253 30 253
evmmax_mul<uint256, secp256k1> +8.0037 +8.0028 28 256 28 256
evmmax_mul<uint256, secp256k1> +8.1311 +8.1302 28 259 28 259
evmmax_mul<uint256, secp256k1> +8.0206 +8.0189 28 255 28 255
evmmax_mul<uint256, secp256k1> +7.6288 +7.6287 28 246 28 246
evmmax_mul<uint256, secp256k1> +7.5113 +7.5106 28 242 28 242
evmmax_mul<uint256, secp256k1> +7.4145 +7.4133 29 243 29 243
evmmax_mul<uint256, secp256k1> +7.5676 +7.5671 28 243 28 243
evmmax_mul<uint256, secp256k1> +7.2458 +7.2454 29 243 29 243
evmmax_mul<uint256, secp256k1>_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean +7.6644 +7.6636 29 249 29 249
evmmax_mul<uint256, secp256k1>_median +7.7730 +7.7721 28 249 28 249
evmmax_mul<uint256, secp256k1>_stddev +10.6559 +10.6522 1 7 1 7
evmmax_mul<uint256, secp256k1>_cv +0.3453 +0.3450 0 0 0 0
OVERALL_GEOMEAN +1.0661 +1.0658 0 0 0 0
As you can see, the results do not seem to indicate a performance improvement. Could you review the proposed code and make suggestions? If there are no suggestions, could you also run the benchmarks?
Since I can't open a PR, let me present my work here.
Why can't you?
Since I can't open a PR, let me present my work here.
Why can't you?
My bad; it's my first open source contribution, I am discovering the processes 😢
Hey @chfast, may I ask you to review the associated PR?
EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
Here is an example project that has applied the algorithm from the paper above: https://github.com/privacy-scaling-explorations/halo2curves/pull/134.