Reducing CachedBeaconState size

dapplion commented 1 year ago

Problem description

From https://github.com/ChainSafe/lodestar/issues/5855 Lodestar's big state memory footprint is an issue for large networks

Lodestar is relatively inefficient when representing sparse data, such a persistent merkle tree:

State	VC len	SSZ	TreeDU	CachedState
mainnet_1880255	216,541	32 MB	229 MB	292 MB
mainnet_2620447	263,370	39 MB	271 MB	356 MB
mainnet_6963647	852,585	121 MB	837 MB	1073 MB (x8.8 SSZ)
forecast	1,400,000	200 MB	-	1700 MB
forecast	1,700,000	240 MB	-	2100 MB
forecast	2,300,000	330 MB	-	2900 MB

Summary of state size by index count

Why is the CachedState so heavy?

The memory cost of basic building blocks of CachedBeaconState using @chainsafe/persistent-merkle-tree is

Validator                   - 625.8 bytes / instance
BranchNodeStruct(Validator) - 765.1 bytes / instance
LeafNode                    - 103.1 bytes / instance
BranchNode                  - 131.4 bytes / instance
blst.PublicKey(affine)      - 82.4 bytes / instance

A single validator record serialized is 121 bytes, while as Javascript objects 625 bytes (x5).
A branch node including hash is 131 bytes, where the min data required (left ptr, right ptr, hash) is 48 (x3).

While this numbers are not great, a lot of work had been done to reduce them from a naive initial design, like using a pseudo fixed size Uint8Array, the HashObject

At large index count, those numbers add up fast. Using mainnet_6963647 with 852585 indexes (note that the branch nodes count of N leaves is N)

validators data:  765 * 852585     = 652 MB
validators cache: 131 * 852585     = 111 MB
balances data:    103 * 852585 / 4 = 21  MB
balances cache:   131 * 852585 / 4 = 28  MB
inactivity data:  103 * 852585 / 4 = 21  MB
inactivity cache: 131 * 852585 / 4 = 28  MB
TOTAL ------------------------------ 861 MB (837 MB measured)

Breakdown of main contributors to BeaconState Tree memory

blst.Pubkeys:      82 * 852585     = 70  MB
PubkeyMap:        157 * 852585     = 133 MB
TOTAL ------------------------------ 203 MB (236 MB measured, 1073 - 837)

Breakdown of main contributors to BeaconStateCache memory

Note that the BeaconStateCache memory cost is payed once per application, and most of the validators array is shared between all states in memory.

Can we do better?

Our main problem is how to represent short binary data (hashes, pubkeys) efficiently in Javascript. The only option to have a ~1x (data represented / memory used) is by having multiple items in a large ArrayBuffer such that the overhead of each instance is amortized.

The BeaconState has the nice property where not much data is changed at once.

Properties

Beacon state field	Type	bytes/idx	Per-block changes	Per-epoch changes
block_roots, state_roots	Vector[Root, 8192]	-	W 1	-
eth1_data_votes	List[Eth1Data, 2048]	-	R all, W 1	-
validators	List[Validator, 1M]	121	W (16:2000)	W (0:1M)
balances	List[Gwei, 1M]	8	W (512:~528)	W 1M
randao_mixes	Vector[Bytes32, 65536]	-	W 1	W 1
slashings	Vector[Gwei, 8192]	-	W (0:1)	R 1M, W 1
epoch_participation	List[uin8, 1M]	2	W 31250	R 1M, W 1M
inactivity_scores	List[uint64, 1M]	8	-	R 1M, W (0:1M)
historical_summaries	List[HistoricalSummary, oo]	-	-	W 1

Block processing

With 1M indexes

processWithdrawals           R (avg 16 max 16384) validators, W 16 balances
processProposerSlashing      W (0:16) validators
processAttesterSlashing      W (0:~500) validators
processAttestations          W 31250 participation bytes
processDeposit               W (0:~700) validators
processVoluntaryExit         W (0:16) validators
processBlsToExecutionChange  W (0:16) validators
processSyncAggregate         W 512 balances
+ hash

Min expected size diff: 35 KB

Epoch processing

processJustificationAndFinalization  R 1M participation, active_balance
processInactivityUpdates             R 1M participation + W (0:1M) inactivity 
processRewardsAndPenalties           R 1M participation, active_balance, inactivity + W 1M balances
processRegistryUpdates               R 1M validators + W (0:20) validators
processSlashings                     R 1M validators + W (0:1M) validators
processEth1DataReset                 -
processEffectiveBalanceUpdates       R 1M balances + W (0:1M) validators
processSlashingsReset                -
processRandaoMixesReset              -
processHistoricalSummariesUpdate     -
processParticipationFlagUpdates      -
processSyncCommitteeUpdates          shuffling
+ hash
+ shufflings

Min expected size diff: 9 MB

State as flat memory

Both the state data and the hashing cache can be represented as flat memory. Prysm, and Lighthouse have been using this strategy since genesis. However, a naive implementation duplicates all content between forks, limiting the total amount of states that could be hold in memory at once.

But states can be represented as base layer + diff layers

                    /<------[d4] <------[d6]
[b0] <-[d1] <-[d2] <-[d3] <-------[d5]    ^ state_b
                                   ^ state_a

Then states can be represented as the sum of:

state_a = compact([b0, d1, d2, d3, d5])
state_b = compact([b0, d1, d2, d4, d6])

Advantages

Happy case allows to represent state + cache in 2x SSZ size, instead of x8.8 like today
Allows to easily compute the size of diff layers, which unlocks more smart state cache designs

Drawbacks

Each read potentially requires checking every single layer until reaching the base. Frequently read lists like participation, should not be represented in a diff format.
Compacting through forks may require to copy the entire base
Significant complexity

philknows commented 3 weeks ago

This was determined to be a large time investment, and maintenence overhead and may only have memory improvements without known side effects on the CPU. Likely there will be some tradeoffs that may not be worth the risk.

twoeths commented 3 weeks ago

as stated here, Bun is > 5x efficient than NodeJS in term on memory allocation, should give it a try

dapplion commented 2 weeks ago

47 bytes / 32 byte array that's really good! Does Lodestar actually run with bun?

twoeths commented 2 weeks ago

Bun does not support napi for now https://github.com/oven-sh/bun/issues/158 so we cannot run beacon node with it @dapplion

we can only try with persistent-merkle-tree and benchmark the result there

ChainSafe / lodestar