erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io
GNU Lesser General Public License v3.0
3.09k stars 1.08k forks source link

OOM crash in 16 Gb during LogIndex stage #9526

Open battlmonstr opened 6 months ago

battlmonstr commented 6 months ago

System information

Erigon version: ./erigon --version

v2.57.1

OS & Version: Windows/Linux/OSX

Linux

Commit hash:

9f1cd651f0b1b443b4bd96eaed84502c149fdca2

Erigon Command (with flags/config):

--chain=mainnet
--prune=htrc
--batchSize=128M
--db.size.limit=1TB
--internalcl
--metrics
--pprof

Consensus Layer:

caplin

Consensus Layer Command (with flags/config):

--internalcl

Chain/Network:

mainnet

Expected behaviour

No crash.

Actual behaviour

Crash.

Steps to reproduce the behaviour

Sync from scratch until stage 10/12 LogIndex.

Backtrace

Latest DEBUG log lines before the crash:

[INFO] [02-26|15:56:07.925] [10/12 LogIndex] Progress                number=18779613 alloc=6.7GB sys=14.2GB
[INFO] [02-26|15:56:09.176] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-55986105
[INFO] [02-26|15:56:38.275] [10/12 LogIndex] Progress                number=18789219 alloc=9.0GB sys=14.2GB
[INFO] [02-26|15:57:08.765] [10/12 LogIndex] Progress                number=18799392 alloc=11.2GB sys=14.2GB
[INFO] [02-26|15:57:15.497] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-4208647017
[INFO] [02-26|15:57:17.390] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-3571994502
battlmonstr commented 6 months ago

@AskAlexSharov is there something like --batchSize for LogIndex?

battlmonstr commented 6 months ago

One more crash around block 12.5M:

[INFO] [02-27|15:22:31.915] [10/12 LogIndex] Progress                number=12568622 alloc=10.7GB sys=13.9GB
[INFO] [02-27|15:23:01.912] [10/12 LogIndex] Progress                number=12577585 alloc=10.3GB sys=13.9GB
battlmonstr commented 6 months ago

Heap dump before the crash: 15

battlmonstr commented 6 months ago

This is a dump 5 minutes before the crash for comparison: 1

They look very similar. Maybe the problem is on the mdbx side, not in Go heap?

AskAlexSharov commented 6 months ago

--internalcl - I see on your picture: SpawnHistoryDownload - seems it happening in background and eating ~1G. I guess it can eat less or improve it's mem-limit, or adapt to total ram on machine.

AskAlexSharov commented 6 months ago

you can proof it by running stage_log_index without other erigon parts: integration stage_log_index

battlmonstr commented 6 months ago

@AskAlexSharov Yeah, at the time of the crash I've seen something in the logs about the history downloading. I've ran the integration stage offline successfully. After erigon restarted, it went to 12/12 Finish 🎉 .

AskAlexSharov commented 6 months ago

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

pngwerks commented 5 months ago

Also seeing an OOM kill during LogIndex stage with 16GB memory and GOMEMLIMIT = 13GiB

From journalctl:

Mar 18 03:50:01 ethnode kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/supervisor.service,task=erigon,pid=952888,uid=1001 Mar 18 03:50:01 ethnode kernel: Out of memory: Killed process 952888 (erigon) total-vm:17215695868kB, anon-rss:11155356kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:3872040kB oom_score_adj:0 Mar 18 03:50:01 ethnode systemd[1]: supervisor.service: A process of this unit has been killed by the OOM killer.

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

What is the command option for this? Couldn't find in the manual.

AskAlexSharov commented 5 months ago

this PR may help: https://github.com/ledgerwatch/erigon/pull/9814