Added Metrics Implemented metrics to gain visibility into memory utilization.
Optimized Allocations Optimized object and slice allocations, resulting in significant improvements in memory usage.
File: erigon
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 25.92GB, 94.14% of 27.53GB total
Dropped 785 nodes (cum <= 0.14GB)
Showing top 10 nodes out of 70
flat flat% sum% cum cum%
15.37GB 55.82% 55.82% 15.37GB 55.82% github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).AddLogs.func1
8.15GB 29.60% 85.42% 8.15GB 29.60% github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).AddPendingTxs.func1
0.64GB 2.33% 87.75% 1GB 3.65% github.com/ledgerwatch/erigon/turbo/rpchelper.(*LogsFilterAggregator).distributeLog
0.50GB 1.82% 89.57% 0.65GB 2.35% github.com/ledgerwatch/erigon/core/state.(*stateObject).GetCommittedState
0.36GB 1.32% 90.89% 0.36GB 1.32% github.com/ledgerwatch/erigon/turbo/rpchelper.(*LogsFilterAggregator).distributeLog.func1
0.31GB 1.12% 92.01% 0.31GB 1.12% bytes.growSlice
0.30GB 1.08% 93.09% 0.30GB 1.08% github.com/ledgerwatch/erigon/core/vm.(*Memory).Resize (inline)
0.15GB 0.56% 93.65% 0.35GB 1.27% github.com/ledgerwatch/erigon/core/state.(*IntraBlockState).AddSlotToAccessList
0.07GB 0.24% 93.90% 0.23GB 0.84% github.com/ledgerwatch/erigon/core/types.codecSelfer2.decLogs
0.07GB 0.24% 94.14% 1.58GB 5.75% github.com/ledgerwatch/erigon/core/vm.(*EVMInterpreter).Run
(pprof) list AddLogs.func1
Total: 27.53GB
ROUTINE ======================== github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).AddLogs.func1 in github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go
15.37GB 15.37GB (flat, cum) 55.82% of Total
. . 644: ff.logsStores.DoAndStore(id, func(st []*types.Log, ok bool) []*types.Log {
. . 645: if !ok {
. . 646: st = make([]*types.Log, 0)
. . 647: }
15.37GB 15.37GB 648: st = append(st, logs)
. . 649: return st
. . 650: })
. . 651:}
. . 652:
. . 653:// ReadLogs reads logs from the store associated with the given subscription ID.
(pprof) list AddPendingTxs.func1
Total: 27.53GB
ROUTINE ======================== github.com/ledgerwatch/erigon/turbo/rpchelper.(*Filters).AddPendingTxs.func1 in github.com/ledgerwatch/erigon/turbo/rpchelper/filters.go
8.15GB 8.15GB (flat, cum) 29.60% of Total
. . 686: ff.pendingTxsStores.DoAndStore(id, func(st [][]types.Transaction, ok bool) [][]types.Transaction {
. . 687: if !ok {
. . 688: st = make([][]types.Transaction, 0)
. . 689: }
8.15GB 8.15GB 690: st = append(st, txs)
. . 691: return st
. . 692: })
. . 693:}
. . 694:
. . 695:// ReadPendingTxs reads pending transactions from the store associated with the given subscription ID.
(pprof) %
Identified unbound slices that could grow indefinitely, leading to memory leaks. This issue occurs if subscribers do not request updates using their subscription ID, especially when behind a load balancer that does not pin clients to RPC nodes.
Architecture Design Changes
Implement architectural changes to pin clients to RPC Nodes ensuring subscribers that request updates using their subscription IDs will hit the same node, which does clean up the objecs on each request.
Reason for not choosing: This still relies on clients with subscriptions to request updates and if they never do and do not call the unsubscribe function it's an unbound memory leak. The RPC node operator has no control over the behavior of the clients.
Implementing Timeouts
Introduce timeouts for subscriptions to automatically clean up unresponsive or inactive subscriptions.
Reason for not choosing: Implementing timeouts would inadvertently stop websocket subscriptions due to the current design of the code, leading to a potential loss of data for active users. This solution would be good but requires a lot more work.
Configurable Limits (Chosen Solution)
Set configurable limits for various subscription parameters (e.g., logs, headers, transactions, addresses, topics) to manage memory utilization effectively.
Reason for choosing: This approach provides flexibility to RPC node operators to configure limits as per their requirements. The default behavior remains unchanged, making it a non-breaking change. Additionally, it ensures current data is always available by pruning the oldest data first.
relates to https://github.com/erigontech/erigon/issues/11890 cherry pick from E3 https://github.com/erigontech/erigon/commit/adf3f438d8aae4c749e5ddacb3fe08afd6e695e7
GetOrCreateGaugeVec
function to manage PrometheusGaugeVec
metrics.FiltersConfig
and added comprehensive tests for the new filter limits.RPC nodes would frequently OOM. Based on metrics and pprof, I observed high memory utilization in the rpchelper:
With these changes, the memory utilization has significantly improved, and the system is now more stable and predictable.
<img width="1567" alt="image" src="https://github.com/ledgerwatch/erigon/assets/787344/264c9757-93cf-4be8-9883-5ca3187acd73">
Blue line indicates a deployment Sharp drop of green line is an OOM