0xPolygonZero / zk_evm

Apache License 2.0
85 stars 37 forks source link

Refactor Cpu and Memory use #800

Open 4l0n50 opened 3 hours ago

4l0n50 commented 3 hours ago

After running block 20240052 batching 50 transactions, in the first segment of size 2^{22}, we observe that the memory table is by far the longest. It contains approximately 14M rows, followed by the CPU with ~4M and Keccak with ~0.5M, as shown in the table.

Table length
arithmetic_len 976080
byte_packing_len 91024
cpu_len 4194304
keccak_len 618528
keccak_sponge_len 25772
logic_len 191582
memory_len 14367487
mem_before_len 64546
mem_after_len 105792
Memory usage disaggregated by context is as follows: Key Value
Code 7635326
Stack 5516437
Calldata 136414
KernelGeneral 180674
MainMemory 449786
... ...

90% of the memory usage comes from the Code and Stack segments, contributing ~50% and ~40% of the rows, respectively. Notably, ~4M of the memory in the Code segment is due to the CPU reading the next opcode during execution. The remaining ~3M rows arise primarily from reading the code and RLP-encoded tries for hashing, as detailed in the following table, which disaggregates memory rows by (Segment, Operation) pairs.

(Segment, Operation) Value
(Code, ReadCode) 4194303
(Code, Some(KeccakGeneral)) 2916407
(Stack, Some(Dup(1))) 635436
(Stack, TopOfTheStack) 587655
(Stack, Some(Dup(0))) 408426
(Stack, Some(Push(3))) 406736
(Stack, Some(BinaryArithmetic(Add)) 397852
(Stack, Some(Push(1))) 378083
(Stack, Some(Swap(1))) 310266
(Code, Some(Mload32Bytes)) 237514
(MainMemory, Some(Mstore32Bytes(32))) 199104
(Stack, Some(Push(5))) 181984
(Stack, Some(Dup(2))) 190726
(Stack, Some(BinaryArithmetic(Eq))) 72024
(Stack, Some(Swap(2))) 75624
(Stack, Some(BinaryArithmetic(Gt))) 74026
(Calldata, Some(Mload32Bytes)) 75822
(Stack, Some(GetContext)) 64686
(Stack, Some(TernaryArithmetic(MulMod))) 63970
(Stack, Some(MstoreGeneral)) 62851

(showing 3855698 rows, which represents ~90% from a total of 4194304)

The (0, Some(KeccakGeneral)) operations represent memory reads for hashing the RLP-encoded tries, kernel code, and other hashes.

Refactor proposal

The proposed idea is twofold:

  1. Combine certain opcodes or eliminate others to reduce the number of cycles (and hence code reads).
  2. Simultaneously, embed the known-at-compile time values within the opcodes in order to reduce Stack memory operations.

We will do so for different opcodes based on their frequencies, as shown in the table below:

Operation Value Percentage (%)
Push(3) 406739 9.70
BinaryArithmetic(Add) 397852 9.48
Push(1) 378712 9.03
Dup(1) 317718 7.58
Swap(0) 228534 5.45
Jumpi 248754 5.93
Dup(0) 204213 4.87
MloadGeneral 172510 4.11
Jump 155863 3.72
Swap(1) 155133 3.70
Pop 105914 2.53
Dup(2) 95363 2.27
BinaryArithmetic(Gt) 74026 1.76
Push(32) 66708 1.59
GetContext 64686 1.54
Iszero 58091 1.38
BinaryArithmetic(Shr) 56061 1.34
BinaryLogic(And) 50990 1.22
BinaryArithmetic(Mul) 48145 1.15

PUSH

The primary purpose of a PUSH operation is to provide a value for a subsequent opcode. Typically, this corresponds to one of the "consuming" opcodes in the table, such as BinaryArithmetic(Add), Jumpi, MloadGeneral, Jump, etc. Since PUSH arguments are constants known at compile time, a straightforward optimization involves introducing new constant-specific opcodes, such as BinaryArithmetic(AddConst(const_addend)), JumpiConst(const_jumpdest), or MloadGeneralConst(const_addr). This approach would also reduce stack manipulation opcodes like SWAP and DUP by handling constants directly.

DUP(0), DUP(1), DUP(2)

We could potentially eliminate many DUP operations by introducing "non-consuming" variants of consuming opcodes. The need for DUP arises because most opcodes consume one or two topmost stack elements, and some values may need to be used multiple times.

SWAP(0), SWAP(1), SWAP(2) and

Without DUP operations, the remaining "pure" stack manipulation opcodes would be SWAP(0), SWAP(1), SWAP(2). These don't directly interact with memory but modify stack element addresses. An optimization could involve introducing a new opcode SWAP(i, j)_XXX for small values of i, j, where XXX is a frequently used opcode following a SWAP. The idea is for SWAP(i, j)_XXX to execute XXX using inputs offset by i and j from the stack top (e.g., XXX would become SWAP(0,1)_XXX).

Jump and Jumpi

Most jump destinations are known at compile time. Frequently, we observe calls like %jump(dest) with a constant dest or JUMP operations where the return destination is a statically-known-value at the stack top. Additionally, all opcodes essentially incorporate a JUMP to pc+1. This could be generalized into opcodes of the form JUMP(b, dest)_XXX, where (b \in {0, 1}) and the jump destination is (b \cdot (pc + 1) + (1-b) \cdot \text{dest}). For example, XXX would correspond to JUMP(1, 0)_XXX.

Next steps

This note analyzes the performance of a few transactions in block 20240052. The next steps would involve extending this analysis to other blocks to validate or disprove the observations. Additionally, a preliminary optimization approach for Push, Dup, Swap, and Jump operations is presented, which collectively account for ~80% of executed opcodes and ~60% of Stack memory operations.
If these optimizations lead to a ~50% reduction in CPU cycles and Stack memory operations, overall memory usage could decrease by approximately 50%. However, these ideas remain preliminary and may be either incorrect or suboptimal. Further exploration is needed to refine or identify better strategies.

4l0n50 commented 3 hours ago

I'm getting the benchmarks using this branch. I'm running the trace_decoder tests and (in not-so-elegant way) collecting and printing the data.