IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL
Apache License 2.0
290 stars 160 forks source link

Right way to shut down db-sync #1010

Open dorijan opened 2 years ago

dorijan commented 2 years ago

Hi to all, what is the right way to shutdown db-sync? Last few times that I had to shut down my servers, every time I had block that needs to be deleted, and this takes time, for example for every 100 block it takes about 1 minute, so at least 30-40 for deletion, and after that about 1 hour to sync to up-to-date. So what is the right way to shut down db-sync? I tried docker stop and docker-compose down and in both ways I got this sector deletion.

and from log:

Deleting 3157 blocks up to slot 50126122 Deleting 3628 blocks up to slot 50136038

Thank you

kderme commented 2 years ago

There's unfortunately not much you can do currently.

We should install signal handlers for this, which create a disk snapshot before exiting.

dorijan commented 2 years ago

Ok, I understand! Is there a way to minimize number of blocks that will be deleted? I mean, in what part of db-sync process is the best way to terminate, I would like to create disk images for later use when I need them, so I would like to minimize deletion time. Thank you.

kderme commented 2 years ago

In the folder that you specify with --state-dir periodically snapshots files are added. When there is a fresh snapshot, it is the best time to stop db-sync to minimize rollback time after restarting.

For release 12.0.0, new snapshots are taken:

erikd commented 2 years ago

Directly after a message like Deleting 3157 blocks up to slot 50126122 you should see a message Blocks deleted.

If you subtract the later time stamp from the former, you can get the time required for the rollback. Assuming you have the database setup correctly this should not take as much as 30 minutes.

dorijan commented 2 years ago

Yes, that is about right(30 minutes), but I am not sure why it was 3157 block and 3628 , if you said that snapshots are taken every 500 block, since my node and everything was fully synced and following the node tip. Shouldnt it then be less than 500? This happend 2 times in a row when I did docker stop db-sync, so I would say it is not coinsidence?

I am not sure why it need to delete this blocks, but maybe it is ok to delete like 10 last blocks just to be sure, and only check for the rest if data is correct?

I just had a issue where I needed new servers ASAP that was in standby due to increase in traffice, but could not get them up in time, so I am looking a way to mitigate this problem. I have running servers, and I can stop them, clone them and have them working quick but this takes at least 2 hours to delete and then sync for the actual tip.

Anyway, thank you for your help, I understand now that I am not doing anything wrong, just the system is designed that way.

erikd commented 2 years ago

if you said that snapshots are taken every 500 block,

Snapshots are only taken every 500 blocks if there is no volatile data in memory that has not yet been inserted into the database.

I am not sure why it need to delete this blocks, but maybe it is ok to delete like 10 last blocks just to be sure, and only check for the rest if data is correct?

Unfortunately this is all incredibly complex and someone without a full knowledge of it speculating on how it might be improved is not helpful.

I understand now that I am not doing anything wrong, just the system is designed that way.

The design was forced by a number of rather onerous constraints that are too low level and complex to detail here.

erikd commented 2 years ago

From logs on my instance (running in a 16G VM, so very memory constrained):

[2022-01-10 05:48:18.85 UTC] Deleting 483 blocks up to slot 50227169
[2022-01-10 05:52:18.04 UTC] Blocks deleted 

which is 6 minutes to delete 483 blocks.

erikd commented 2 years ago

@kderme The rollbacks (of more than a couple of blocks) are only needed because of the volatile data. That means there are two ways of fixing this issue; add the signal handler (which is more complicated and may have its own issues) or remove the need for the volatile data (which would actually simplify db-sync). I would very much prefer to investigate the later option.

kderme commented 2 years ago

We can remove the volatile data if:

erikd commented 2 years ago

I am hoping for the second option which from the POV of db-sync is the simplest solution.

BisonDee commented 2 years ago

We have also had issues with non-graceful termination of db-sync, sometimes rolling back 100s or 1000s of blocks. This takes hours to get synced back to head for mainnet. I have heard from folks that on their deployments getting back to head is much faster, but this is something we have had no luck with. We have experimented with postgres configuration, huge RAM and CPU allotments and these are on gp2 1T drives.

Handling signals (SIGTERM or SIGINT or both) with a graceful termination to avoid these deletions would be great.

erikd commented 2 years ago

@BisonDee Even with a graceful shutdown, db-sync will still need to rollback anything up to an epoch's worth of blocks due to the volatile data as mentioned in one of the comments above. We are working to remove the use of volatile data and when that is done, stopping/restarting db-sync should require drastically fewer blocks to be rollbacked back. Most restarts should only roll back at most one or two blocks. Anything more than a 3 block rollback would be extremely rare.

dorijan commented 2 years ago

@erikd that would be great!

BisonDee commented 2 years ago

@BisonDee Even with a graceful shutdown, db-sync will still need to rollback anything up to an epoch's worth of blocks due to the volatile data as mentioned in one of the comments above. We are working to remove the use of volatile data and when that is done, stopping/restarting db-sync should require drastically fewer blocks to be rollbacked back. Most restarts should only roll back at most one or two blocks. Anything more than a 3 block rollback would be extremely rare.

Thanks for the response. Is there a roadmap for this development that we might be able to track or a timeline for when we might see this roll out?

erikd commented 2 years ago

Unfortunately, it is incredibly difficult to predict how long this will take, mainly because this requires changes to lower level libraries.

air3ijai commented 2 years ago

Can someone please explain what is Rollbacks and why do we need it?

My question is about how we can setup GraphQL using DB-Sync with restore from snapshot.

If we look at this comment - https://github.com/input-output-hk/cardano-graphql/issues/642#issuecomment-998604144, it is about stop-start of the DB-Sync after restoring from the snapshot.

Do we need to do it and if yes - why? Will we loose some data and how much?

After initial restore from the snapshot we see the following in db-sync logs ``` ALTER TABLE ALTER TABLE All good! [db-sync-node:Info:6] [2022-06-08 13:22:01.47 UTC] Schema migration files validated [db-sync-node:Info:6] [2022-06-08 13:22:01.47 UTC] Running database migrations [db-sync-node:Info:6] [2022-06-08 13:22:01.47 UTC] Enviroment variable DbSyncAbortOnPanic: False [db-sync-node:Info:6] [2022-06-08 13:22:01.75 UTC] Using byron genesis file from: "/config/cardano-db-sync/../cardano-node/../genesis/byron.json" [db-sync-node:Info:6] [2022-06-08 13:22:01.75 UTC] Using shelley genesis file from: "/config/cardano-db-sync/../cardano-node/../genesis/shelley.json" [db-sync-node:Info:6] [2022-06-08 13:22:01.75 UTC] Using alonzo genesis file from: "/config/cardano-db-sync/../cardano-node/../genesis/alonzo.json" [db-sync-node:Info:6] [2022-06-08 13:22:02.45 UTC] NetworkMagic: 764824073 [db-sync-node:Info:6] [2022-06-08 13:22:03.00 UTC] Initial genesis distribution present and correct [db-sync-node:Info:6] [2022-06-08 13:22:03.00 UTC] Total genesis supply of Ada: 31112484745.000000 [db-sync-node:Info:6] [2022-06-08 13:22:03.00 UTC] Inserting Shelley Genesis distribution [db-sync-node:Info:6] [2022-06-08 13:22:03.00 UTC] epochPluginOnStartup: Checking [db-sync-node:Info:6] [2022-06-08 13:22:03.00 UTC] localInitiatorNetworkApplication: connecting to node via "/node-ipc/node.socket" [db-sync-node.Subscription:Notice:48] [2022-06-08 13:22:03.00 UTC] Identity Starting Subscription Worker, valency 1 [db-sync-node.Subscription:Notice:49] [2022-06-08 13:22:03.00 UTC] Identity Connection Attempt Start, destination LocalAddress "/node-ipc/node.socket" [db-sync-node.Subscription:Notice:49] [2022-06-08 13:22:03.00 UTC] Identity Connection Attempt End, destination LocalAddress "/node-ipc/node.socket" outcome: ConnectSuccessLast [db-sync-node.Handshake:Info:49] [2022-06-08 13:22:03.00 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/ipc/node.socket"}) Send (ClientAgency TokPropose,MsgProposeVersions (fromList [(NodeToClientV_1,TInt 764824073),(NodeToClientV_2,TInt 764824073),(NodeToClientV_3,TInt 764824073),(NodeToClientV_4,TInt 764824073),(NodeToClientV_5,TInt 764824073),(NodeToClientV_6,TInt 764824073),(NodeToClientV_7,TInt 764824073),(NodeToClientV_8,TInt 764824073),(NodeToClientV_9,TInt 764824073),(NodeToClientV_10,TInt 764824073)])) [db-sync-node.Subscription:Notice:48] [2022-06-08 13:22:03.02 UTC] Identity Required subscriptions started [db-sync-node.Handshake:Info:49] [2022-06-08 13:22:03.05 UTC] WithMuxBearer (ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/ipc/node.socket"}) Recv (ServerAgency TokConfirm,MsgAcceptVersion NodeToClientV_10 (TInt 764824073)) [db-sync-node.Mux:Info:49] [2022-06-08 13:22:03.05 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/ipc/node.socket"} event: Handshake Client end, duration 0.047911747s [db-sync-node.Mux:Info:50] [2022-06-08 13:22:03.05 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "/ipc/node.socket"} event: State: Mature [db-sync-node:Info:53] [2022-06-08 13:22:03.05 UTC] Starting chainSyncClient [db-sync-node:Info:53] [2022-06-08 13:22:03.06 UTC] Cardano.Db tip is at slot 62380796, block 7313329 [db-sync-node:Info:58] [2022-06-08 13:22:03.06 UTC] Running DB thread [db-sync-node:Info:58] [2022-06-08 13:22:03.34 UTC] Rolling back to genesis [db-sync-node:Info:58] [2022-06-08 13:22:41.25 UTC] Deleting 7313329 blocks up to slot 62380796 ```
  1. What is rolling back to the genesis - will it start to sync from genesis?
  2. Why it delete something after restoring from snapshot?

Query to the DB

   sync_percent
------------------
 99.4950536444525
       behind_by
------------------------
 8 days 16:19:56.131231

And GraphSQL is not responding

curl -XPOST http://localhost:3100/graphql -H "Content-Type: application/json" -d  '{"query": "{ cardanoDbMeta { initialized syncPercentage }}"}'

curl: (52) Empty reply from server
air3ijai commented 2 years ago

After db-sync stuck ~ 2 days at same message in the logs, we decided to restart it

[db-sync-node:Info:523] [2022-06-10 11:10:56.72 UTC] Starting chainSyncClient
[db-sync-node:Info:523] [2022-06-10 11:10:56.73 UTC] Cardano.Db tip is at slot 62380796, block 7313329
[db-sync-node:Info:528] [2022-06-10 11:10:56.74 UTC] Running DB thread
[db-sync-node:Info:528] [2022-06-10 11:10:56.74 UTC] Rolling back to slot 62380796, hash 5e262cfeaa8f380a7ea1e1affa2a02c81e073cda5217587d7a6eb112cbe432fd
[db-sync-node:Info:528] [2022-06-10 11:10:56.75 UTC] No blocks need to be deleted
[db-sync-node.Subscription:Notice:47] [2022-06-10 11:10:58.61 UTC] Identity Required subscriptions started
[db-sync-node:Info:528] [2022-06-10 11:11:56.50 UTC] Found snapshot file for slot 62380796, hash 5e262cfeaa8f380a7ea1e1affa2a02c81e073cda5217587d7a6eb112cbe432fd
[db-sync-node:Info:531] [2022-06-10 11:12:07.16 UTC] getHistoryInterpreter: acquired
[db-sync-node:Info:528] [2022-06-10 11:12:07.16 UTC] thisIsAnUglyHack: Main thead
[db-sync-node:Info:542] [2022-06-10 11:12:07.16 UTC] Running Offline fetch thread
[db-sync-node:Info:528] [2022-06-10 11:12:50.17 UTC] File /var/lib/cdbsync/62380796-5e262cfeaa-341.lstate exists
[db-sync-node:Info:528] [2022-06-10 11:12:50.17 UTC] Starting at epoch 342
[db-sync-node:Info:528] [2022-06-10 11:12:52.14 UTC] Handling 1195415 stakes for epoch 342 slot 62380881, hash 00ab671aac6728ddddf8d9ba6938a523a528dc04baf609e0d6c3d2819b26f509

DB-Sync epoch 342 - looks like it is near current 344 epoch and it means that it will not start from the genesis.

Postgres

sync_percent: 99.3859428294444
behind_by: 10 days 13:32:50.842566

GraphQL curl: (56) Recv failure: Connection reset by peer - still doesn't work

erikd commented 2 years ago

@air3ijai

Firstly, your questions have nothing to do with the topic being discussed in this issue.

Secondly, I know close to nothing about cardano-graphql . If you have questions about that, ask them in that repo.

Can someone please explain what is rollbacks and why do we need it?

Rollbacks are part of the normal operation of a distributed ledger like Cardano or Bitcoin.

We need extra rollbacks in db-sync because db-sync keeps some volatile data in memory as it is queued to be put into the database. If there is any data in memory, when db-sync is stopped and restarted, it will be lost, so we will need to rollback to before the volatile data was pulled from the ledger and stored in the queue.

After db-sync stuck ~ 2 days at same message in the logs, we decided to restart it

What do the node logs say? What hardware are you running it on? Are you meeting the minimum system requirements as per the README docs?

air3ijai commented 2 years ago

@erikd, thank you for some clarifications. Probably, a part of my question is not directly related to the db-sync but I found some information that graphql can't open the port until db-sync finished to sync.

And it looks slightly unusual that after sync from snapshot it stuck, so we need to restart db-sync container and after this it will continue to follow the node.

Our first two experiments were very strange, because after restore from recent snapshot and restart, db-sync continued from very old epoch which is not related to the latest epoch in the snapshot.

This is the source of my question - why db-sync after restoring from snapshot continued to sync not from lates snapshot block and why do we see any kind of rollback.

After you explanation, we can state that rollback we see in the logs should not be the cause of such a behaviour.

At the moment, when db-sync stuck, node followed the chain and I didn't observed anything strange. Anyway, I can't say more right now, as whole stack was restarted.

For now, we are running on AWS - i4i.xlarge / SSD I/O performance

4 vCPU / 32 GiB

1 x 937 AWS Nitro SSD
100% Random Read IOPS: 100,000
Write IOPS: 55,000

As for now, both, db-sync and node on same latest epoch - 344

cardano-graphql-cardano-db-sync-extended-1  | [db-sync-node:Info:528] [2022-06-11 14:18:41.58 UTC] insertShelleyBlock(Alonzo): epoch 344, slot 63390829, block 7362365, hash 0548eca453174eafef86ceb1edeaba3b0675ba430aa82b065bd3a67e1659f915

cardano-graphql-cardano-node-ogmios-1       | {"severity":"Info","timestamp":"2022-06-11T14:18:40.648941607Z","thread":"944","message":{"Health":{"tag":"HealthTick","status":{"startTime":"2022-06-10T10:31:42.375398575Z","lastKnownTip":{"slot":63390829,"hash":"0548eca453174eafef86ceb1edeaba3b0675ba430aa82b065bd3a67e1659f915","blockNo":7362365},"lastTipUpdate":"2022-06-11T14:18:40.648781385Z","networkSynchronization":0.99999,"currentEra":"Alonzo","metrics":{"sessionDurations":{"max":0,"min":0,"mean":0},"totalMessages":0,"runtimeStats":{"currentHeapSize":723,"maxHeapSize":726,"gcCpuTime":155751766544,"cpuTime":207803941036},"activeConnections":0,"totalUnrouted":0,"totalConnections":0},"connectionStatus":"connected","currentEpoch":344,"slotInEpoch":146029}}},"version":"v5.1.0"}