erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.12k stars 1.11k forks source link

Erigon OOM Killed - (currently trying 2.53.4) #8838

Open mdominoni opened 10 months ago

mdominoni commented 10 months ago

System information erigon version 2.53.4

OS & Version: Linux / Ubuntu on AWS with 64 GB RAM

Commit hash: tag - v2.53.4

Erigon Service:

[Unit] Description=Erigon Execution Layer Client service (Mainet) Wants=network-online.target After=network-online.target

[Service] Environment="GOGC=50 GOMEMLIMIT=24GiB GOMAXPROCS=2" MemoryLimit=24G OOMScoreAdjust=-100 Type=simple User=root Restart=allways RestartSec=5 KillSignal=SIGINT TimeoutStopSec=300 ExecStart=/opt/erigon/build/bin/erigon \ --datadir /opt/data/erigon \ --chain mainnet \ --port "30303" \ --metrics \ --pprof \ --authrpc.jwtsecret "/opt/secrets/jwt.hex" \ --http \ --ws \ --http.vhosts="" \ --http.corsdomain="" \ --http.addr="0.0.0.0" \ --http.port "8545" \ --http.api "eth,erigon,personal,db,admin,web3,net,trace,rpc,debug,txpool" \ --txpool.api.addr "0.0.0.0:9094" \ --private.api.addr "0.0.0.0:9090" \ --batchSize=1G [Install] WantedBy=multi-user.target

Consensus Layer: lighthouse Lighthouse v4.5.0-441fc16

Consensus Service:

[Unit] Description=Lighthouse Consensus Layer Client BN (Mainet) Wants=network-online.target After=network-online.target

[Service] Type=simple User=root Restart=allways RestartSec=5 KillSignal=SIGINT TimeoutStopSec=300 ExecStart=/usr/local/bin/lighthouse bn \ --network mainnet \ --datadir "/opt/data/lighthouse" \ --execution-endpoint http://localhost:8551 \ --execution-jwt "/opt/secrets/jwt.hex" \ --checkpoint-sync-url https://mainnet.checkpoint.sigp.io \ --disable-deposit-contract-sync \ --reconstruct-historic-states \ --metrics

[Install] WantedBy=multi-user.target

Chain/Network: mainnet

Expected behaviour Node properly syncs after version upgrarde

Actual behaviour After a couple of hours synchronized, erigon get's killed by OOM

Steps to reproduce the behaviour Full sync on v2.51.0, then upgrade to v2.53.4

Backtrace N/A

Executed go tool pprof -inuse_space -png http://127.0.0.1:6060/debug/pprof/heap > mem.png mem

AskAlexSharov commented 10 months ago

This mem.png shows - everything is good: using expected 3gb

mdominoni commented 10 months ago

Ok, but OOM is still happening, is there anything else I can do to prevent this happening all the time? Screenshot from 2023-11-27 14-30-14

dmesg shows:

[210146.815414] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=eth1.service,mems_allowed=0,oom_memcg=/system.slice/eth1.service,task_memcg=/system.slice/eth1.service,task=erigon,pid=7926,uid=0 [210146.815570] Memory cgroup out of memory: Killed process 7926 (erigon) total-vm:5312414528kB, anon-rss:20685544kB, file-rss:2650224kB, shmem-rss:0kB, UID:0 pgtables:4081092kB oom_score_adj:-100 [210148.956419] oom_reaper: reaped process 7926 (erigon), now anon-rss:0kB, file-rss:1958520kB, shmem-rss:0kB

AskAlexSharov commented 10 months ago

and what shows alloc in logs before kill?

AskAlexSharov commented 10 months ago

try get profiling when alloc > 5g

mdominoni commented 10 months ago

[txpool] stat pending=9964 baseFee=0 queued=5125 alloc=3.1GB sys=7.5GB

mem

AskAlexSharov commented 10 months ago

Unfortunately this pic is healthy

luarx commented 7 months ago

Just to clarify, is it normal that 64 GB are not enought to run Erigon?