Open tibineacsu95 opened 3 months ago
We had the same problem too
Same issue with 1.2.1-d
The error message implies your node cannot start because its database is corrupted. Probably it has crashed or was killed at some point and the systemd has restarted it automatically. The described message is produced by the following run, which fails to start, because the db is already corrupted.
Can you provide logs from the original crash? It is necessary to understand whats happened. Thanks!
It looks like an OOM kill.
Jul 07 18:28:09 sonic01 systemd[1]: sonic.service: A process of this unit has been killed by the OOM killer. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Main process exited, code=killed, status=9/KILL Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Failed with result 'oom-kill'. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Consumed 1d 23h 57min 8.615s CPU time.
This is from a freshly synced node started 2 days ago from scratch. This time I used Restart=no
in the service file. If I try to start the service back up, I get the same message, sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE
although I was hoping the DB wouldn't get corrupted this time since the service didn't get to actually restart.
Each parameter used for the service is tuned based on the server specs (128 GB RAM and 32 CPUs):
Any advice here? Should I set the values for the limit and the cache lower? And if so, what would be the suitable numbers for this spec? Thanks in advance!
I'm sorry, running in container and got it flushed. But it means SIGTERM and if it did't stop in 10sec then SIGKILL.
Is there any way to fix the corruption? Because unclean shutdowns happens even in production environments and having to wait multiple days for archive genesis to be processed is really painful...
Just for reference, updated to 1.2.1-d
, tried lowering GOMEMLIMIT
to 70% (docs say it should be fine with 90%) of the total RAM of the machine, and the --cache
to 25% (docs say it should be fine with 40%) but the crashes still occur - same outcome, OOM kill.
Link to docs - https://docs.fantom.foundation/node/tutorials/sonic-client/run-an-api-node
This was on a freshly installed machine, and the systemd service is configured to not restart automatically in case of any crash, and has a timeout set to 600 seconds (which should be more than enough for the service to stop gracefully) - the DB still gets corrupted.
If I stop the service manually, using systemctl stop sonic.service
, it shuts down correctly (takes about 5 minutes) and I am able to just restart it normally afterwards.
We brought up 4 nodes, they all crashed due to the same reason, but at different points in time.
I did some testing and can confirm that SIGTERM and SIGKILL makes database corrupted. So the main questions stands:
@janzhanal Is there is docker image to run or you build your own?(didn't find docker image for Sonic chain, only for opera)
Building my own.
Hello all, today I have a corruption even if the app reported proper closure:
INFO [07-20|08:02:47.143] New block index=86203336 id=294680:3321:af86b2 gas_used=1,832,237 txs=5/0 age=1.545s t=7.585ms
INFO [07-20|08:02:47.448] Got interrupt, shutting down...
INFO [07-20|08:02:47.449] IPC endpoint closed url=/data/opera.ipc
INFO [07-20|08:02:47.449] Stopping Fantom protocol
INFO [07-20|08:02:49.040] Fantom protocol stopped
INFO [07-20|08:02:49.133] Fantom service stopped
INFO [07-20|08:02:52.898] Closing State DB... module=evm-store
root@ovh-us-hi-10:~# docker logs fantom-mainnet-archive-sonicd -f --tail 10
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:24.236] Maximum peer count total=50
INFO [07-20|08:04:24.236] Smartcard socket not found, disabling err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:25.170] Maximum peer count total=50
INFO [07-20|08:04:25.170] Smartcard socket not found, disabling err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE
Would also really appreciate a way to recover a dirty state db.
@janzhanal The Closing State DB...
log message needs to be followed by State DB closed
message, otherwise the app is not terminated property. Are you sure the process was not killed, like by OOM for example?
Describe the bug Fantom Sonic Mainnet Archive node gets corrupted DB.
To Reproduce Steps to reproduce the behavior:
sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE
Expected behavior Node is able to sync properly, without getting its DB corrupted.
Desktop (please complete the following information):
Additional context Not quite sure how to mitigate this. It's the second time we're running into such issues on two different nodes - changed the machines as well thinking it would be a local storage problem.
We are using systemd, here is the service file:
Any feedback is highly appreciated!