ethereum / evmone

Fast Ethereum Virtual Machine implementation
Apache License 2.0
864 stars 286 forks source link

Implement Fast CIOS for Montgomery modular multiplication #1009

Open clementjuventin opened 2 months ago

clementjuventin commented 2 months ago

This PR implements the improvement of the CIOS variant of the Montgomery algorithm showcased in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication and proposed in https://github.com/ethereum/evmone/issues/869. The optimization occurs when most_significant_p_word < (word_size / 2), p is represented by the variable mod in the current implementation.

Here are the benchmarks performed after applying these changes starting from commit 01eca779c58c19c8659fd5a6bf0cad85613c629b.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_mul<uint256, bn254>_pvalue                     0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean                      +7.8884         +7.8873            27           238            27           238
evmmax_mul<uint256, bn254>_median                    +7.8595         +7.8584            27           237            27           237
evmmax_mul<uint256, bn254>_stddev                   +17.5405        +17.5057             0             3             0             3
evmmax_mul<uint256, bn254>_cv                        +1.0859         +1.0823             0             0             0             0
evmmax_mul<uint256, secp256k1>_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean                  +7.6644         +7.6636            29           249            29           249
evmmax_mul<uint256, secp256k1>_median                +7.7730         +7.7721            28           249            28           249
evmmax_mul<uint256, secp256k1>_stddev               +10.6559        +10.6522             1             7             1             7
evmmax_mul<uint256, secp256k1>_cv                    +0.3453         +0.3450             0             0             0             0
OVERALL_GEOMEAN                                      +1.0661         +1.0658             0             0             0             0

As you can see, the results do not indicate a performance improvement.

clementjuventin commented 2 months ago

After further investigation, I found another way of ordering branches (second commit) and obtained what we were looking for (~15% gain on evmmax_mul<uint256, bn254>)!

evmmax_mul<uint256, bn254>_median                    -0.1529         -0.1529            27            23            27            23
evmmax_mul<uint256, secp256k1>_median                +0.0059         +0.0058            28            28            28            28

I still don't get why this implementation is better, even assuming branch prediction makes the checks insignificant. I would also like to compare the assembly code in case there is important optimization under the hood but I never did so let's see if I manage to do it