Fantom Sonic Mainnet Archive node gets corrupted DB

tibineacsu95 commented 3 months ago

Describe the bug Fantom Sonic Mainnet Archive node gets corrupted DB.

To Reproduce Steps to reproduce the behavior:

Set up an Archive node using the recommended steps.
Use a snapshot (https://files.fantom.network/mainnet-284692.tar.gz) to not sync from scratch in order to reduce sync time.
Node starts syncing from the point the snapshot is in.
Database gets corrupted after a couple of hours: sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

Expected behavior Node is able to sync properly, without getting its DB corrupted.

Desktop (please complete the following information):

OS: Ubuntu 22.04.4 LTS
Version: Sonicd 1.2.1-b

Additional context Not quite sure how to mitigate this. It's the second time we're running into such issues on two different nodes - changed the machines as well thinking it would be a local storage problem.

We are using systemd, here is the service file:

Any feedback is highly appreciated!

blockpi019 commented 3 months ago

We had the same problem too

janzhanal commented 3 months ago

Same issue with 1.2.1-d

thaarok commented 3 months ago

The error message implies your node cannot start because its database is corrupted. Probably it has crashed or was killed at some point and the systemd has restarted it automatically. The described message is produced by the following run, which fails to start, because the db is already corrupted.

Can you provide logs from the original crash? It is necessary to understand whats happened. Thanks!

tibineacsu95 commented 3 months ago

It looks like an OOM kill.

Jul 07 18:28:09 sonic01 systemd[1]: sonic.service: A process of this unit has been killed by the OOM killer. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Main process exited, code=killed, status=9/KILL Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Failed with result 'oom-kill'. Jul 07 18:28:17 sonic01 systemd[1]: sonic.service: Consumed 1d 23h 57min 8.615s CPU time.

This is from a freshly synced node started 2 days ago from scratch. This time I used Restart=no in the service file. If I try to start the service back up, I get the same message, sonicd[288420]: failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE although I was hoping the DB wouldn't get corrupted this time since the service didn't get to actually restart.

Each parameter used for the service is tuned based on the server specs (128 GB RAM and 32 CPUs):

GOMEMLIMIT=116GiB
--cache 51200

Any advice here? Should I set the values for the limit and the cache lower? And if so, what would be the suitable numbers for this spec? Thanks in advance!

janzhanal commented 3 months ago

I'm sorry, running in container and got it flushed. But it means SIGTERM and if it did't stop in 10sec then SIGKILL.

Is there any way to fix the corruption? Because unclean shutdowns happens even in production environments and having to wait multiple days for archive genesis to be processed is really painful...

tibineacsu95 commented 2 months ago

Just for reference, updated to 1.2.1-d, tried lowering GOMEMLIMIT to 70% (docs say it should be fine with 90%) of the total RAM of the machine, and the --cache to 25% (docs say it should be fine with 40%) but the crashes still occur - same outcome, OOM kill.

Link to docs - https://docs.fantom.foundation/node/tutorials/sonic-client/run-an-api-node

This was on a freshly installed machine, and the systemd service is configured to not restart automatically in case of any crash, and has a timeout set to 600 seconds (which should be more than enough for the service to stop gracefully) - the DB still gets corrupted.

If I stop the service manually, using systemctl stop sonic.service, it shuts down correctly (takes about 5 minutes) and I am able to just restart it normally afterwards.

We brought up 4 nodes, they all crashed due to the same reason, but at different points in time.

janzhanal commented 2 months ago

I did some testing and can confirm that SIGTERM and SIGKILL makes database corrupted. So the main questions stands:

is there a way to prevent corruption in case of unclean shutdown?
is there a way to recover corrupted db?
of course, a plan to implement those features would be appreciated as well (otherwise the tool is not usable in production and I would mark this as a critical issue)

insider89 commented 2 months ago

@janzhanal Is there is docker image to run or you build your own?(didn't find docker image for Sonic chain, only for opera)

janzhanal commented 2 months ago

Building my own.

janzhanal commented 2 months ago

Hello all, today I have a corruption even if the app reported proper closure:

INFO [07-20|08:02:47.143] New block                                index=86203336 id=294680:3321:af86b2  gas_used=1,832,237  txs=5/0    age=1.545s          t=7.585ms
INFO [07-20|08:02:47.448] Got interrupt, shutting down... 
INFO [07-20|08:02:47.449] IPC endpoint closed                      url=/data/opera.ipc
INFO [07-20|08:02:47.449] Stopping Fantom protocol 
INFO [07-20|08:02:49.040] Fantom protocol stopped 
INFO [07-20|08:02:49.133] Fantom service stopped 
INFO [07-20|08:02:52.898] Closing State DB...                      module=evm-store
root@ovh-us-hi-10:~# docker logs fantom-mainnet-archive-sonicd -f --tail 10
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:24.236] Maximum peer count                       total=50
INFO [07-20|08:04:24.236] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: lachesis-294680: DE
INFO [07-20|08:04:25.170] Maximum peer count                       total=50
INFO [07-20|08:04:25.170] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
failed to initialize the node: failed to make consensus engine: failed to open existing databases: dirty state: gossip: DE

flolege commented 2 months ago

Would also really appreciate a way to recover a dirty state db.

thaarok commented 2 months ago

@janzhanal The Closing State DB... log message needs to be followed by State DB closed message, otherwise the app is not terminated property. Are you sure the process was not killed, like by OOM for example?

Fantom-foundation / Sonic

Fantom Sonic Mainnet Archive node gets corrupted DB #167