Implement Fast CIOS for Montgomery modular multiplication

Hey @chfast, thanks for redirecting me from https://github.com/ethereum/evmone/issues/777.

Since I can't open a PR, let me present my work here.

Here is the mul method from the evmmax.hpp file implementing the improvement of the CIOS variant of the Montgomery algorithm proposed in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication

    /// Performs a Montgomery modular multiplication.
    ///
    /// Inputs must be in Montgomery form: x = aR, y = bR.
    /// This computes Montgomery multiplication xyR⁻¹ % mod what gives aRbRR⁻¹ % mod = abR % mod.
    /// The result (abR) is in Montgomery form.
    constexpr UintT mul(const UintT& x, const UintT& y) const noexcept
    {
        // Coarsely Integrated Operand Scanning (CIOS) Method
        // Based on 2.3.2 from
        // High-Speed Algorithms & Architectures For Number-Theoretic Cryptosystems
        // https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
        // and on 2.2 from
        // EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
        // https://eprint.iacr.org/2022/1400.pdf

        constexpr auto S = UintT::num_words;  // TODO(C++23): Make it static

        intx::uint<UintT::num_bits + 64> t;
        if (mod[S - 1] < std::numeric_limits<uint64_t>::max() / 2)
        {
            for (size_t i = 0; i != S; ++i)
            {
                uint64_t c = 0;
                for (size_t j = 0; j != S; ++j)
                    std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
                auto const c_2 = c;
                const auto m = t[0] * m_mod_inv;
                std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
                for (size_t j = 1; j != S; ++j)
                    std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
                t[S - 1] = c_2 + c;
            }
        }
        else
        {
            for (size_t i = 0; i != S; ++i)
            {
                uint64_t c = 0;
                for (size_t j = 0; j != S; ++j)
                    std::tie(c, t[j]) = addmul(t[j], x[j], y[i], c);
                auto tmp = intx::addc(t[S], c);
                t[S] = tmp.value;
                const auto d = tmp.carry;  // TODO: Carry is 0 for sparse modulus.

                const auto m = t[0] * m_mod_inv;
                std::tie(c, std::ignore) = addmul(t[0], m, mod[0], 0);
                for (size_t j = 1; j != S; ++j)
                    std::tie(c, t[j - 1]) = addmul(t[j], m, mod[j], c);
                tmp = intx::addc(t[S], c);
                t[S - 1] = tmp.value;
                t[S] = d + tmp.carry;  // TODO: Carry is 0 for sparse modulus.
            }
        }

        if (t >= mod)
            t -= mod;

        return static_cast<UintT>(t);
    }

Here are the benchmarks performed after applying these changes starting from commit 01eca779c58c19c8659fd5a6bf0cad85613c629b.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_add<uint256, bn254>                           +0.0247         +0.0245             3             3             3             3
evmmax_add<uint256, bn254>                           +0.0402         +0.0400             3             4             3             4
evmmax_add<uint256, bn254>                           -0.0072         -0.0073             4             4             4             4
evmmax_add<uint256, bn254>                           +0.0150         +0.0149             4             4             4             4
evmmax_add<uint256, bn254>                           +0.0491         +0.0489             3             4             3             4
evmmax_add<uint256, bn254>                           +0.0137         +0.0136             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0018         -0.0021             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0073         -0.0074             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0039         -0.0040             3             3             3             3
evmmax_add<uint256, bn254>                           -0.0287         -0.0290             3             3             3             3
evmmax_add<uint256, bn254>_pvalue                     0.6776          0.7337      U Test, Repetitions: 10 vs 10
evmmax_add<uint256, bn254>_mean                      +0.0092         +0.0091             3             3             3             3
evmmax_add<uint256, bn254>_median                    +0.0181         +0.0180             3             3             3             3
evmmax_add<uint256, bn254>_stddev                    +0.2974         +0.2984             0             0             0             0
evmmax_add<uint256, bn254>_cv                        +0.2855         +0.2868             0             0             0             0
evmmax_add<uint256, secp256k1>                       +0.0261         +0.0260             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0089         -0.0090             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0105         -0.0107             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0223         -0.0224             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0231         -0.0233             3             3             3             3
evmmax_add<uint256, secp256k1>                       -0.0136         -0.0138             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0423         +0.0421             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0394         +0.0393             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0058         +0.0057             3             3             3             3
evmmax_add<uint256, secp256k1>                       +0.0089         +0.0087             3             3             3             3
evmmax_add<uint256, secp256k1>_pvalue                 0.6776          0.6776      U Test, Repetitions: 10 vs 10
evmmax_add<uint256, secp256k1>_mean                  +0.0041         +0.0039             3             3             3             3
evmmax_add<uint256, secp256k1>_median                +0.0013         +0.0012             3             3             3             3
evmmax_add<uint256, secp256k1>_stddev                -0.2987         -0.2986             0             0             0             0
evmmax_add<uint256, secp256k1>_cv                    -0.3015         -0.3013             0             0             0             0
evmmax_sub<uint256, bn254>                           +0.0036         +0.0034             2             2             2             2
evmmax_sub<uint256, bn254>                           -0.0068         -0.0069             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0163         +0.0162             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0066         +0.0064             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0417         +0.0415             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0421         +0.0419             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0298         +0.0298             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0404         +0.0402             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0800         +0.0798             2             2             2             2
evmmax_sub<uint256, bn254>                           +0.0197         +0.0197             2             2             2             2
evmmax_sub<uint256, bn254>_pvalue                     0.0376          0.0452      U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, bn254>_mean                      +0.0274         +0.0273             2             2             2             2
evmmax_sub<uint256, bn254>_median                    +0.0273         +0.0272             2             2             2             2
evmmax_sub<uint256, bn254>_stddev                    +0.9074         +0.9082             0             0             0             0
evmmax_sub<uint256, bn254>_cv                        +0.8565         +0.8576             0             0             0             0
evmmax_sub<uint256, secp256k1>                       +0.0005         +0.0003             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0062         -0.0062             2             2             2             2
evmmax_sub<uint256, secp256k1>                       +0.0020         +0.0018             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0151         -0.0152             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0151         -0.0151             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0508         -0.0509             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0428         -0.0430             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0583         -0.0584             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0616         -0.0617             2             2             2             2
evmmax_sub<uint256, secp256k1>                       -0.0423         -0.0425             2             2             2             2
evmmax_sub<uint256, secp256k1>_pvalue                 0.0091          0.0073      U Test, Repetitions: 10 vs 10
evmmax_sub<uint256, secp256k1>_mean                  -0.0295         -0.0296             2             2             2             2
evmmax_sub<uint256, secp256k1>_median                -0.0305         -0.0306             2             2             2             2
evmmax_sub<uint256, secp256k1>_stddev                -0.7974         -0.7983             0             0             0             0
evmmax_sub<uint256, secp256k1>_cv                    -0.7913         -0.7922             0             0             0             0
evmmax_mul<uint256, bn254>                           +7.9425         +7.9408            27           243            27           243
evmmax_mul<uint256, bn254>                           +8.0145         +8.0139            27           241            27           241
evmmax_mul<uint256, bn254>                           +7.7713         +7.7701            27           236            27           236
evmmax_mul<uint256, bn254>                           +7.8369         +7.8361            27           237            27           237
evmmax_mul<uint256, bn254>                           +7.8531         +7.8522            27           235            27           235
evmmax_mul<uint256, bn254>                           +7.8605         +7.8592            27           237            27           237
evmmax_mul<uint256, bn254>                           +7.7501         +7.7493            27           235            27           235
evmmax_mul<uint256, bn254>                           +7.9913         +7.9902            27           242            27           242
evmmax_mul<uint256, bn254>                           +7.8688         +7.8676            27           238            27           238
evmmax_mul<uint256, bn254>                           +7.9943         +7.9935            27           240            27           240
evmmax_mul<uint256, bn254>_pvalue                     0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean                      +7.8884         +7.8873            27           238            27           238
evmmax_mul<uint256, bn254>_median                    +7.8595         +7.8584            27           237            27           237
evmmax_mul<uint256, bn254>_stddev                   +17.5405        +17.5057             0             3             0             3
evmmax_mul<uint256, bn254>_cv                        +1.0859         +1.0823             0             0             0             0
evmmax_mul<uint256, secp256k1>                       +7.5847         +7.5839            30           256            30           256
evmmax_mul<uint256, secp256k1>                       +7.5657         +7.5643            30           253            30           253
evmmax_mul<uint256, secp256k1>                       +8.0037         +8.0028            28           256            28           256
evmmax_mul<uint256, secp256k1>                       +8.1311         +8.1302            28           259            28           259
evmmax_mul<uint256, secp256k1>                       +8.0206         +8.0189            28           255            28           255
evmmax_mul<uint256, secp256k1>                       +7.6288         +7.6287            28           246            28           246
evmmax_mul<uint256, secp256k1>                       +7.5113         +7.5106            28           242            28           242
evmmax_mul<uint256, secp256k1>                       +7.4145         +7.4133            29           243            29           243
evmmax_mul<uint256, secp256k1>                       +7.5676         +7.5671            28           243            28           243
evmmax_mul<uint256, secp256k1>                       +7.2458         +7.2454            29           243            29           243
evmmax_mul<uint256, secp256k1>_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean                  +7.6644         +7.6636            29           249            29           249
evmmax_mul<uint256, secp256k1>_median                +7.7730         +7.7721            28           249            28           249
evmmax_mul<uint256, secp256k1>_stddev               +10.6559        +10.6522             1             7             1             7
evmmax_mul<uint256, secp256k1>_cv                    +0.3453         +0.3450             0             0             0             0
OVERALL_GEOMEAN                                      +1.0661         +1.0658             0             0             0             0

As you can see, the results do not seem to indicate a performance improvement. Could you review the proposed code and make suggestions? If there are no suggestions, could you also run the benchmarks?

ethereum / evmone

Implement Fast CIOS for Montgomery modular multiplication #869