After running block 20240052 batching 50 transactions, in the first segment of size 2^{22}, we observe that the memory table is by far the longest. It contains approximately 14M rows, followed by the CPU with ~4M and Keccak with ~0.5M, as shown in the table.
Table
length
arithmetic_len
976080
byte_packing_len
91024
cpu_len
4194304
keccak_len
618528
keccak_sponge_len
25772
logic_len
191582
memory_len
14367487
mem_before_len
64546
mem_after_len
105792
Memory usage disaggregated by context is as follows:
Key
Value
Code
7635326
Stack
5516437
Calldata
136414
KernelGeneral
180674
MainMemory
449786
...
...
90% of the memory usage comes from the Code and Stack segments, contributing ~50% and ~40% of the rows, respectively. Notably, ~4M of the memory in the Code segment is due to the CPU reading the next opcode during execution. The remaining ~3M rows arise primarily from reading the code and RLP-encoded tries for hashing, as detailed in the following table, which disaggregates memory rows by (Segment, Operation) pairs.
(Segment, Operation)
Value
(Code, ReadCode)
4194303
(Code, Some(KeccakGeneral))
2916407
(Stack, Some(Dup(1)))
635436
(Stack, TopOfTheStack)
587655
(Stack, Some(Dup(0)))
408426
(Stack, Some(Push(3)))
406736
(Stack, Some(BinaryArithmetic(Add))
397852
(Stack, Some(Push(1)))
378083
(Stack, Some(Swap(1)))
310266
(Code, Some(Mload32Bytes))
237514
(MainMemory, Some(Mstore32Bytes(32)))
199104
(Stack, Some(Push(5)))
181984
(Stack, Some(Dup(2)))
190726
(Stack, Some(BinaryArithmetic(Eq)))
72024
(Stack, Some(Swap(2)))
75624
(Stack, Some(BinaryArithmetic(Gt)))
74026
(Calldata, Some(Mload32Bytes))
75822
(Stack, Some(GetContext))
64686
(Stack, Some(TernaryArithmetic(MulMod)))
63970
(Stack, Some(MstoreGeneral))
62851
(showing 3855698 rows, which represents ~90% from a total of 4194304)
The (0, Some(KeccakGeneral)) operations represent memory reads for hashing the RLP-encoded tries, kernel code, and other hashes.
Refactor proposal
The proposed idea is twofold:
Combine certain opcodes or eliminate others to reduce the number of cycles (and hence code reads).
Simultaneously, embed the known-at-compile time values within the opcodes in order to reduce Stack memory operations.
We will do so for different opcodes based on their frequencies, as shown in the table below:
Operation
Value
Percentage (%)
Push(3)
406739
9.70
BinaryArithmetic(Add)
397852
9.48
Push(1)
378712
9.03
Dup(1)
317718
7.58
Swap(0)
228534
5.45
Jumpi
248754
5.93
Dup(0)
204213
4.87
MloadGeneral
172510
4.11
Jump
155863
3.72
Swap(1)
155133
3.70
Pop
105914
2.53
Dup(2)
95363
2.27
BinaryArithmetic(Gt)
74026
1.76
Push(32)
66708
1.59
GetContext
64686
1.54
Iszero
58091
1.38
BinaryArithmetic(Shr)
56061
1.34
BinaryLogic(And)
50990
1.22
BinaryArithmetic(Mul)
48145
1.15
PUSH
The primary purpose of a PUSH operation is to provide a value for a subsequent opcode. Typically, this corresponds to one of the "consuming" opcodes in the table, such as BinaryArithmetic(Add), Jumpi, MloadGeneral, Jump, etc. Since PUSH arguments are constants known at compile time, a straightforward optimization involves introducing new constant-specific opcodes, such as BinaryArithmetic(AddConst(const_addend)), JumpiConst(const_jumpdest), or MloadGeneralConst(const_addr). This approach would also reduce stack manipulation opcodes like SWAP and DUP by handling constants directly.
DUP(0), DUP(1), DUP(2)
We could potentially eliminate many DUP operations by introducing "non-consuming" variants of consuming opcodes. The need for DUP arises because most opcodes consume one or two topmost stack elements, and some values may need to be used multiple times.
SWAP(0), SWAP(1), SWAP(2) and
Without DUP operations, the remaining "pure" stack manipulation opcodes would be SWAP(0), SWAP(1), SWAP(2). These don't directly interact with memory but modify stack element addresses. An optimization could involve introducing a new opcode SWAP(i, j)_XXX for small values of i, j, where XXX is a frequently used opcode following a SWAP. The idea is for SWAP(i, j)_XXX to execute XXX using inputs offset by i and j from the stack top (e.g., XXX would become SWAP(0,1)_XXX).
Jump and Jumpi
Most jump destinations are known at compile time. Frequently, we observe calls like %jump(dest) with a constant dest or JUMP operations where the return destination is a statically-known-value at the stack top. Additionally, all opcodes essentially incorporate a JUMP to pc+1. This could be generalized into opcodes of the form JUMP(b, dest)_XXX, where (b \in {0, 1}) and the jump destination is (b \cdot (pc + 1) + (1-b) \cdot \text{dest}). For example, XXX would correspond to JUMP(1, 0)_XXX.
Next steps
This note analyzes the performance of a few transactions in block 20240052. The next steps would involve extending this analysis to other blocks to validate or disprove the observations. Additionally, a preliminary optimization approach for Push, Dup, Swap, and Jump operations is presented, which collectively account for ~80% of executed opcodes and ~60% of Stack memory operations.
If these optimizations lead to a ~50% reduction in CPU cycles and Stack memory operations, overall memory usage could decrease by approximately 50%. However, these ideas remain preliminary and may be either incorrect or suboptimal. Further exploration is needed to refine or identify better strategies.
After running block
20240052
batching 50 transactions, in the first segment of size 2^{22}, we observe that the memory table is by far the longest. It contains approximately 14M rows, followed by the CPU with ~4M and Keccak with ~0.5M, as shown in the table.90% of the memory usage comes from the
Code
andStack
segments, contributing ~50% and ~40% of the rows, respectively. Notably, ~4M of the memory in theCode
segment is due to the CPU reading the next opcode during execution. The remaining ~3M rows arise primarily from reading the code and RLP-encoded tries for hashing, as detailed in the following table, which disaggregates memory rows by(Segment, Operation)
pairs.(showing 3855698 rows, which represents ~90% from a total of 4194304)
The
(0, Some(KeccakGeneral))
operations represent memory reads for hashing the RLP-encoded tries, kernel code, and other hashes.Refactor proposal
The proposed idea is twofold:
Stack
memory operations.We will do so for different opcodes based on their frequencies, as shown in the table below:
PUSH
The primary purpose of a
PUSH
operation is to provide a value for a subsequent opcode. Typically, this corresponds to one of the "consuming" opcodes in the table, such asBinaryArithmetic(Add)
,Jumpi
,MloadGeneral
,Jump
, etc. SincePUSH
arguments are constants known at compile time, a straightforward optimization involves introducing new constant-specific opcodes, such asBinaryArithmetic(AddConst(const_addend))
,JumpiConst(const_jumpdest)
, orMloadGeneralConst(const_addr)
. This approach would also reduce stack manipulation opcodes likeSWAP
andDUP
by handling constants directly.DUP(0), DUP(1), DUP(2)
We could potentially eliminate many
DUP
operations by introducing "non-consuming" variants of consuming opcodes. The need forDUP
arises because most opcodes consume one or two topmost stack elements, and some values may need to be used multiple times.SWAP(0), SWAP(1), SWAP(2)
andWithout
DUP
operations, the remaining "pure" stack manipulation opcodes would beSWAP(0), SWAP(1), SWAP(2)
. These don't directly interact with memory but modify stack element addresses. An optimization could involve introducing a new opcodeSWAP(i, j)_XXX
for small values ofi, j
, whereXXX
is a frequently used opcode following aSWAP
. The idea is forSWAP(i, j)_XXX
to executeXXX
using inputs offset byi
andj
from the stack top (e.g.,XXX
would becomeSWAP(0,1)_XXX
).Jump
andJumpi
Most jump destinations are known at compile time. Frequently, we observe calls like
%jump(dest)
with a constantdest
orJUMP
operations where the return destination is a statically-known-value at the stack top. Additionally, all opcodes essentially incorporate aJUMP
topc+1
. This could be generalized into opcodes of the formJUMP(b, dest)_XXX
, where (b \in {0, 1}) and the jump destination is (b \cdot (pc + 1) + (1-b) \cdot \text{dest}). For example,XXX
would correspond toJUMP(1, 0)_XXX
.Next steps
This note analyzes the performance of a few transactions in block
20240052
. The next steps would involve extending this analysis to other blocks to validate or disprove the observations. Additionally, a preliminary optimization approach forPush
,Dup
,Swap
, andJump
operations is presented, which collectively account for ~80% of executed opcodes and ~60% ofStack
memory operations.If these optimizations lead to a ~50% reduction in CPU cycles and
Stack
memory operations, overall memory usage could decrease by approximately 50%. However, these ideas remain preliminary and may be either incorrect or suboptimal. Further exploration is needed to refine or identify better strategies.