Open MrFreezeDZ opened 1 month ago
This was our issue before with Erigon in version 2.59.3: https://github.com/erigontech/erigon/issues/10873
@MrFreezeDZ hey, to help us troubleshoot this can you please:
git checkout e2-polgyon-headers-stage-oom
make erigon
SAVE_HEAP_PROFILE=true
and start Erigon: export SAVE_HEAP_PROFILE=true && ./build/bin/erigon ...
This branch will try to capture the heap profile when near OOM when we face the stage headers "Rejected header marked as bad" situation and will save it in $TMPDIR/erigon-mem.prof
. You can do echo $TMPDIR
to find your OS tmpdir.
When this issue re-occurs please send us that file. Or can send us a png after doing go tool pprof -png $TMPDIR/erigon-mem.prof > erigon-mem-prof.png
@MrFreezeDZ this change is now in v2.60.6
so you can update to that and won't need to use the branch I mentioned in my previous message.
Were you able to run Erigon with export SAVE_HEAP_PROFILE=true && ./build/bin/ergion
and reproduce the OOM again?
Hi, I just came back from holidays :) Right now we were not able to reproduce the behavior with the enabled flag. Just to be clear, we are not able to actively reproduce the behavior. We just have the problem from time to time. We had it four weeks in a row on the weekends. Then we had two weeks without any problem and on the last weekend it happened again. I will use the environment variable SAVE_HEAP_PROFILE=true in our next deployment with version 2.60.6, but this will take some time.
Recently i saw:
[WARN] [08-30|02:08:46.608] [7/9 Execution] Execution failed block=61101954 txNum=4495312603 hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 err="invalid block, txnIdx=59, gas used by execution: 10771434, in header: 18160412, headerNum=61101954, 24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413"
[INFO] [08-30|02:08:46.608] [7/9 Execution] Done blk=61101953 blks=1 blk/s=7.0 txs=104 tx/s=729 gas/s=123.37M buf=114.8KB/2.0GB stepsInDB=0.00 step=2877.0 alloc=22.3GB sys=34.5GB
[EROR] [08-30|02:08:46.640] Staged Sync err="bad block unwinding"
[INFO] [08-30|02:08:47.141] [2/9 Headers] Waiting for headers... from=61101953 hash=0xab87f6a3a5a1c8c400a22c9fef0f0c6c54e99764f496b32768af78c7661f00bb
[WARN] [08-30|02:08:47.382] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.383] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.383] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.407] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.423] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.425] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.451] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.465] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.517] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.551] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.571] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.571] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.572] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
@taratorio so, to reproduce - you just need something like this: https://github.com/erigontech/erigon/pull/11799
System information
Erigon version from the logs:
OS & Version: Kubernetes, Image from here: https://hub.docker.com/r/thorax/erigon The Erigon container has resource limits of 48 CPUs and 208GiB Memory.
Erigon Command (with flags/config):
Consensus Layer: Heimdall 1.0.7
Consensus Layer Command (with flags/config): /usr/bin/heimdalld start --home=/heimdall-home
Chain/Network: bor-mainnet
Expected behaviour
When Erigon realizes that a header is rejected an the same header will be present again and again, it should unwind itself some blocks, I guess.
Actual behaviour
Erigon starts spamming log messages and will use more memory then the container has as memory limit. Then Kubernetes will OOMKill the container and the Container is restarted. After Erigon's restart the same messages will occur until the next OOMKill. Erigon logs lines similar to this with only a few milliseconds between each line until it is OOMKilled:
Steps to reproduce the behaviour
I do not know how to actively reproduce this behavior. It occurred on the last three weekends.
Backtrace
These are just the logs copied from Googles LogExplorer. Here one can see the restart and the beginning of the "rejected header" logs.