Open dorijan opened 2 years ago
There's unfortunately not much you can do currently.
We should install signal handlers for this, which create a disk snapshot before exiting.
Ok, I understand! Is there a way to minimize number of blocks that will be deleted? I mean, in what part of db-sync process is the best way to terminate, I would like to create disk images for later use when I need them, so I would like to minimize deletion time. Thank you.
In the folder that you specify with --state-dir
periodically snapshots files are added. When there is a fresh snapshot, it is the best time to stop db-sync to minimize rollback time after restarting.
For release 12.0.0, new snapshots are taken:
Directly after a message like Deleting 3157 blocks up to slot 50126122
you should see a message Blocks deleted
.
If you subtract the later time stamp from the former, you can get the time required for the rollback. Assuming you have the database setup correctly this should not take as much as 30 minutes.
Yes, that is about right(30 minutes), but I am not sure why it was 3157 block and 3628 , if you said that snapshots are taken every 500 block, since my node and everything was fully synced and following the node tip. Shouldnt it then be less than 500? This happend 2 times in a row when I did docker stop db-sync, so I would say it is not coinsidence?
I am not sure why it need to delete this blocks, but maybe it is ok to delete like 10 last blocks just to be sure, and only check for the rest if data is correct?
I just had a issue where I needed new servers ASAP that was in standby due to increase in traffice, but could not get them up in time, so I am looking a way to mitigate this problem. I have running servers, and I can stop them, clone them and have them working quick but this takes at least 2 hours to delete and then sync for the actual tip.
Anyway, thank you for your help, I understand now that I am not doing anything wrong, just the system is designed that way.
if you said that snapshots are taken every 500 block,
Snapshots are only taken every 500 blocks if there is no volatile data in memory that has not yet been inserted into the database.
I am not sure why it need to delete this blocks, but maybe it is ok to delete like 10 last blocks just to be sure, and only check for the rest if data is correct?
Unfortunately this is all incredibly complex and someone without a full knowledge of it speculating on how it might be improved is not helpful.
I understand now that I am not doing anything wrong, just the system is designed that way.
The design was forced by a number of rather onerous constraints that are too low level and complex to detail here.
From logs on my instance (running in a 16G VM, so very memory constrained):
[2022-01-10 05:48:18.85 UTC] Deleting 483 blocks up to slot 50227169
[2022-01-10 05:52:18.04 UTC] Blocks deleted
which is 6 minutes to delete 483 blocks.
@kderme The rollbacks (of more than a couple of blocks) are only needed because of the volatile data. That means there are two ways of fixing this issue; add the signal handler (which is more complicated and may have its own issues) or remove the need for the volatile data (which would actually simplify db-sync
). I would very much prefer to investigate the later option.
We can remove the volatile data if:
I am hoping for the second option which from the POV of db-sync
is the simplest solution.
We have also had issues with non-graceful termination of db-sync, sometimes rolling back 100s or 1000s of blocks. This takes hours to get synced back to head for mainnet. I have heard from folks that on their deployments getting back to head is much faster, but this is something we have had no luck with. We have experimented with postgres configuration, huge RAM and CPU allotments and these are on gp2
1T drives.
Handling signals (SIGTERM or SIGINT or both) with a graceful termination to avoid these deletions would be great.
@BisonDee Even with a graceful shutdown, db-sync
will still need to rollback anything up to an epoch's worth of blocks due to the volatile data as mentioned in one of the comments above. We are working to remove the use of volatile data and when that is done, stopping/restarting db-sync
should require drastically fewer blocks to be rollbacked back. Most restarts should only roll back at most one or two blocks. Anything more than a 3 block rollback would be extremely rare.
@erikd that would be great!
@BisonDee Even with a graceful shutdown,
db-sync
will still need to rollback anything up to an epoch's worth of blocks due to the volatile data as mentioned in one of the comments above. We are working to remove the use of volatile data and when that is done, stopping/restartingdb-sync
should require drastically fewer blocks to be rollbacked back. Most restarts should only roll back at most one or two blocks. Anything more than a 3 block rollback would be extremely rare.
Thanks for the response. Is there a roadmap for this development that we might be able to track or a timeline for when we might see this roll out?
Unfortunately, it is incredibly difficult to predict how long this will take, mainly because this requires changes to lower level libraries.
Can someone please explain what is Rollbacks and why do we need it?
My question is about how we can setup GraphQL using DB-Sync with restore from snapshot.
If we look at this comment - https://github.com/input-output-hk/cardano-graphql/issues/642#issuecomment-998604144, it is about stop-start of the DB-Sync after restoring from the snapshot.
Do we need to do it and if yes - why? Will we loose some data and how much?
Query to the DB
sync_percent
------------------
99.4950536444525
behind_by
------------------------
8 days 16:19:56.131231
And GraphSQL is not responding
curl -XPOST http://localhost:3100/graphql -H "Content-Type: application/json" -d '{"query": "{ cardanoDbMeta { initialized syncPercentage }}"}'
curl: (52) Empty reply from server
After db-sync stuck ~ 2 days at same message in the logs, we decided to restart it
[db-sync-node:Info:523] [2022-06-10 11:10:56.72 UTC] Starting chainSyncClient
[db-sync-node:Info:523] [2022-06-10 11:10:56.73 UTC] Cardano.Db tip is at slot 62380796, block 7313329
[db-sync-node:Info:528] [2022-06-10 11:10:56.74 UTC] Running DB thread
[db-sync-node:Info:528] [2022-06-10 11:10:56.74 UTC] Rolling back to slot 62380796, hash 5e262cfeaa8f380a7ea1e1affa2a02c81e073cda5217587d7a6eb112cbe432fd
[db-sync-node:Info:528] [2022-06-10 11:10:56.75 UTC] No blocks need to be deleted
[db-sync-node.Subscription:Notice:47] [2022-06-10 11:10:58.61 UTC] Identity Required subscriptions started
[db-sync-node:Info:528] [2022-06-10 11:11:56.50 UTC] Found snapshot file for slot 62380796, hash 5e262cfeaa8f380a7ea1e1affa2a02c81e073cda5217587d7a6eb112cbe432fd
[db-sync-node:Info:531] [2022-06-10 11:12:07.16 UTC] getHistoryInterpreter: acquired
[db-sync-node:Info:528] [2022-06-10 11:12:07.16 UTC] thisIsAnUglyHack: Main thead
[db-sync-node:Info:542] [2022-06-10 11:12:07.16 UTC] Running Offline fetch thread
[db-sync-node:Info:528] [2022-06-10 11:12:50.17 UTC] File /var/lib/cdbsync/62380796-5e262cfeaa-341.lstate exists
[db-sync-node:Info:528] [2022-06-10 11:12:50.17 UTC] Starting at epoch 342
[db-sync-node:Info:528] [2022-06-10 11:12:52.14 UTC] Handling 1195415 stakes for epoch 342 slot 62380881, hash 00ab671aac6728ddddf8d9ba6938a523a528dc04baf609e0d6c3d2819b26f509
DB-Sync
epoch 342
- looks like it is near current 344 epoch and it means that it will not start from the genesis.
Postgres
sync_percent: 99.3859428294444
behind_by: 10 days 13:32:50.842566
GraphQL
curl: (56) Recv failure: Connection reset by peer
- still doesn't work
@air3ijai
Firstly, your questions have nothing to do with the topic being discussed in this issue.
Secondly, I know close to nothing about cardano-graphql
. If you have questions about that, ask them in that repo.
Can someone please explain what is rollbacks and why do we need it?
Rollbacks are part of the normal operation of a distributed ledger like Cardano or Bitcoin.
We need extra rollbacks in db-sync
because db-sync
keeps some volatile data in memory as it is queued to be put into the database. If there is any data in memory, when db-sync
is stopped and restarted, it will be lost, so we will need to rollback to before the volatile data was pulled from the ledger and stored in the queue.
After db-sync stuck ~ 2 days at same message in the logs, we decided to restart it
What do the node
logs say? What hardware are you running it on? Are you meeting the minimum system requirements as per the README docs?
@erikd, thank you for some clarifications. Probably, a part of my question is not directly related to the db-sync but I found some information that graphql can't open the port until db-sync finished to sync.
And it looks slightly unusual that after sync from snapshot it stuck, so we need to restart db-sync container and after this it will continue to follow the node.
Our first two experiments were very strange, because after restore from recent snapshot and restart, db-sync continued from very old epoch which is not related to the latest epoch in the snapshot.
This is the source of my question - why db-sync after restoring from snapshot continued to sync not from lates snapshot block and why do we see any kind of rollback.
After you explanation, we can state that rollback we see in the logs should not be the cause of such a behaviour.
At the moment, when db-sync stuck, node followed the chain and I didn't observed anything strange. Anyway, I can't say more right now, as whole stack was restarted.
For now, we are running on AWS - i4i.xlarge / SSD I/O performance
4 vCPU / 32 GiB
1 x 937 AWS Nitro SSD
100% Random Read IOPS: 100,000
Write IOPS: 55,000
As for now, both, db-sync and node on same latest epoch - 344
cardano-graphql-cardano-db-sync-extended-1 | [db-sync-node:Info:528] [2022-06-11 14:18:41.58 UTC] insertShelleyBlock(Alonzo): epoch 344, slot 63390829, block 7362365, hash 0548eca453174eafef86ceb1edeaba3b0675ba430aa82b065bd3a67e1659f915
cardano-graphql-cardano-node-ogmios-1 | {"severity":"Info","timestamp":"2022-06-11T14:18:40.648941607Z","thread":"944","message":{"Health":{"tag":"HealthTick","status":{"startTime":"2022-06-10T10:31:42.375398575Z","lastKnownTip":{"slot":63390829,"hash":"0548eca453174eafef86ceb1edeaba3b0675ba430aa82b065bd3a67e1659f915","blockNo":7362365},"lastTipUpdate":"2022-06-11T14:18:40.648781385Z","networkSynchronization":0.99999,"currentEra":"Alonzo","metrics":{"sessionDurations":{"max":0,"min":0,"mean":0},"totalMessages":0,"runtimeStats":{"currentHeapSize":723,"maxHeapSize":726,"gcCpuTime":155751766544,"cpuTime":207803941036},"activeConnections":0,"totalUnrouted":0,"totalConnections":0},"connectionStatus":"connected","currentEpoch":344,"slotInEpoch":146029}}},"version":"v5.1.0"}
Hi to all, what is the right way to shutdown db-sync? Last few times that I had to shut down my servers, every time I had block that needs to be deleted, and this takes time, for example for every 100 block it takes about 1 minute, so at least 30-40 for deletion, and after that about 1 hour to sync to up-to-date. So what is the right way to shut down db-sync? I tried
docker stop
anddocker-compose down
and in both ways I got this sector deletion.and from log:
Deleting 3157 blocks up to slot 50126122
Deleting 3628 blocks up to slot 50136038
Thank you