ethereum / evmone

Fast Ethereum Virtual Machine implementation
Apache License 2.0
827 stars 273 forks source link

Optimize EVMMAX montgomery multiplication #777

Open chfast opened 8 months ago

chfast commented 8 months ago

Now that we have benchmarks for EVMMAX precompiles we can experiment with different Montgomery multiplication algorithms. Currently we use CIOS.

The reference paper we used have a summary of the complexity of each algorithm. However, these may not be directly applicable to small numbers like 256-bit. Check also what other projects are doing.

triggeredcode commented 8 months ago

@chfast Could I help here? I am good with algorithms, also it will help if I can get a bit more context. Thanks

chfast commented 8 months ago

Hi @triggeredcode,

The two important references are already in the description. You should check out our existing code implementing Montgomery multiplication using CIOS algorithm. https://github.com/ethereum/evmone/blob/v0.11.0/lib/evmmax/evmmax.cpp#L55-L83.

We want to know if swapping this algorithm with any other would bring performance improvements. The good overview of the algorithms is https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf. But you can also do your research.

This code is well tested and we have benchmarks so it should be relatively easy to get feedback.

chfast commented 7 months ago

Possibly generate the algorithms with https://github.com/mit-plv/fiat-crypto.

clementjuventin commented 2 days ago

Hey @chfast, I just saw this issue and I would like to give it a try.

However, I am a beginner and this is my first time working with a benchmark, if I understand correctly, the goal is to improve the performance of the benchmark.

Running this command ./build/bin/evmone-bench test/internal_benchmarks twice, I got these results:

// Execution 1
advanced/total/synth/loop_v1                 5738 ns         5738 ns       121862 gas_rate=1.02273G/s gas_used=5.868k
// Execution 2
advanced/total/synth/loop_v1                 5763 ns         5762 ns       121192 gas_rate=1.01833G/s gas_used=5.868k

So if I understand well, the purpose is to try different implementations of the CIOS version of the Montgomery algorithm such as the FIPS version, Barrett Reduction, or Karatsuba Multiplication and see which one is the most efficient according to the benchmark.

The point I’m wondering about is whether the existing unit tests will be sufficient to ensure the quality of the tested algorithm

Thank you in advance for your response. I am not committing to completing this issue, but I find the problem very interesting and I would like to be able to contribute.

clementjuventin commented 2 hours ago

Hello @chfast,

I have made progress since the message and implemented the Separated Operand Scanning (SOS) version of the Montgomery algorithm to compare it with the current version. I’m leaving the benchmark traces below, but could you give me some guidance on how to achieve a quality benchmark?

CIOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:06:36+02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.41, 1.11, 1.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.48 ns         3.48 ns    202969266
evmmax_add<uint256, secp256k1>       2.79 ns         2.79 ns    250441458
evmmax_sub<uint256, bn254>           1.78 ns         1.78 ns    393497888
evmmax_sub<uint256, secp256k1>       1.78 ns         1.78 ns    393838716
evmmax_mul<uint256, bn254>           27.5 ns         27.5 ns     25418032
evmmax_mul<uint256, secp256k1>       29.1 ns         29.1 ns     23971772

SOS Implementation

user@user:~/sandbox/evmone$ ./build/bin/evmone-bench-internal --benchmark_filter=evmmax*
2024-09-07T14:08:29+02:00
Running ./build/bin/evmone-bench-internal
Run on (16 X 4463 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.83, 1.38, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
evmmax_add<uint256, bn254>           3.43 ns         3.43 ns    205808992
evmmax_add<uint256, secp256k1>       2.85 ns         2.85 ns    245437966
evmmax_sub<uint256, bn254>           1.79 ns         1.79 ns    392664952
evmmax_sub<uint256, secp256k1>       1.77 ns         1.77 ns    392497508
evmmax_mul<uint256, bn254>           26.7 ns         26.7 ns     26228834
evmmax_mul<uint256, secp256k1>       19.1 ns         19.1 ns     36523676

There are differences that seem to suggest that the SOS version could be more efficient on certain benchmarks, but how can I be sure of the accuracy of the results? Plus, the code compiles, and the unit tests pass, but is that enough to prove the correct implementation of my algorithm?

Could you share the methodology you used or some resources that will help me move forward in line with the project expectations?