Refactor Cpu and Memory use

After running block 20240052 batching 50 transactions, in the first segment of size 2^{22}, we observe that the memory table is by far the longest. It contains approximately 14M rows, followed by the CPU with ~4M and Keccak with ~0.5M, as shown in the table.

Table	length
arithmetic_len	976080
byte_packing_len	91024
cpu_len	4194304
keccak_len	618528
keccak_sponge_len	25772
logic_len	191582
memory_len	14367487
mem_before_len	64546
mem_after_len	105792

Memory usage disaggregated by context is as follows:	Key	Value
Code	7635326
Stack	5516437
Calldata	136414
KernelGeneral	180674
MainMemory	449786
...	...

90% of the memory usage comes from the Code and Stack segments, contributing ~50% and ~40% of the rows, respectively. Notably, ~4M of the memory in the Code segment is due to the CPU reading the next opcode during execution. The remaining ~3M rows arise primarily from reading the code and RLP-encoded tries for hashing, as detailed in the following table, which disaggregates memory rows by (Segment, Operation) pairs.

(Segment, Operation)	Value
(Code, ReadCode)	4194303
(Code, Some(KeccakGeneral))	2916407
(Stack, Some(Dup(1)))	635436
(Stack, TopOfTheStack)	587655
(Stack, Some(Dup(0)))	408426
(Stack, Some(Push(3)))	406736
(Stack, Some(BinaryArithmetic(Add))	397852
(Stack, Some(Push(1)))	378083
(Stack, Some(Swap(1)))	310266
(Code, Some(Mload32Bytes))	237514
(MainMemory, Some(Mstore32Bytes(32)))	199104
(Stack, Some(Push(5)))	181984
(Stack, Some(Dup(2)))	190726
(Stack, Some(BinaryArithmetic(Eq)))	72024
(Stack, Some(Swap(2)))	75624
(Stack, Some(BinaryArithmetic(Gt)))	74026
(Calldata, Some(Mload32Bytes))	75822
(Stack, Some(GetContext))	64686
(Stack, Some(TernaryArithmetic(MulMod)))	63970
(Stack, Some(MstoreGeneral))	62851

(showing 3855698 rows, which represents ~90% from a total of 4194304)

The (0, Some(KeccakGeneral)) operations represent memory reads for hashing the RLP-encoded tries, kernel code, and other hashes.

Refactor proposal

The proposed idea is twofold:

Combine certain opcodes or eliminate others to reduce the number of cycles (and hence code reads).
Simultaneously, embed the known-at-compile time values within the opcodes in order to reduce Stack memory operations.

We will do so for different opcodes based on their frequencies, as shown in the table below:

Operation	Value	Percentage (%)
Push(3)	406739	9.70
BinaryArithmetic(Add)	397852	9.48
Push(1)	378712	9.03
Dup(1)	317718	7.58
Swap(0)	228534	5.45
Jumpi	248754	5.93
Dup(0)	204213	4.87
MloadGeneral	172510	4.11
Jump	155863	3.72
Swap(1)	155133	3.70
Pop	105914	2.53
Dup(2)	95363	2.27
BinaryArithmetic(Gt)	74026	1.76
Push(32)	66708	1.59
GetContext	64686	1.54
Iszero	58091	1.38
BinaryArithmetic(Shr)	56061	1.34
BinaryLogic(And)	50990	1.22
BinaryArithmetic(Mul)	48145	1.15

`PUSH`

The primary purpose of a PUSH operation is to provide a value for a subsequent opcode. Typically, this corresponds to one of the "consuming" opcodes in the table, such as BinaryArithmetic(Add), Jumpi, MloadGeneral, Jump, etc. Since PUSH arguments are constants known at compile time, a straightforward optimization involves introducing new constant-specific opcodes, such as BinaryArithmetic(AddConst(const_addend)), JumpiConst(const_jumpdest), or MloadGeneralConst(const_addr). This approach would also reduce stack manipulation opcodes like SWAP and DUP by handling constants directly.

`DUP(0), DUP(1), DUP(2)`

We could potentially eliminate many DUP operations by introducing "non-consuming" variants of consuming opcodes. The need for DUP arises because most opcodes consume one or two topmost stack elements, and some values may need to be used multiple times.

`SWAP(0), SWAP(1), SWAP(2)` and

Without DUP operations, the remaining "pure" stack manipulation opcodes would be SWAP(0), SWAP(1), SWAP(2). These don't directly interact with memory but modify stack element addresses. An optimization could involve introducing a new opcode SWAP(i, j)_XXX for small values of i, j, where XXX is a frequently used opcode following a SWAP. The idea is for SWAP(i, j)_XXX to execute XXX using inputs offset by i and j from the stack top (e.g., XXX would become SWAP(0,1)_XXX).

`Jump` and `Jumpi`

Most jump destinations are known at compile time. Frequently, we observe calls like %jump(dest) with a constant dest or JUMP operations where the return destination is a statically-known-value at the stack top. Additionally, all opcodes essentially incorporate a JUMP to pc+1. This could be generalized into opcodes of the form JUMP(b, dest)_XXX, where (b \in {0, 1}) and the jump destination is (b \cdot (pc + 1) + (1-b) \cdot \text{dest}). For example, XXX would correspond to JUMP(1, 0)_XXX.

Next steps

This note analyzes the performance of a few transactions in block 20240052. The next steps would involve extending this analysis to other blocks to validate or disprove the observations. Additionally, a preliminary optimization approach for Push, Dup, Swap, and Jump operations is presented, which collectively account for ~80% of executed opcodes and ~60% of Stack memory operations.
If these optimizations lead to a ~50% reduction in CPU cycles and Stack memory operations, overall memory usage could decrease by approximately 50%. However, these ideas remain preliminary and may be either incorrect or suboptimal. Further exploration is needed to refine or identify better strategies.

0xPolygonZero / zk_evm