IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.05k stars 721 forks source link

[BUG] - Restarting node just got a LOT slower #5884

Open TerminadaPool opened 2 months ago

TerminadaPool commented 2 months ago

External

Area

Summary Nodes very slow to restart with significant delay during "Pushing ledger state" from 22% to 44%.

Steps to reproduce systemctl restart cardano-node

Expected behavior Restart on a low power ARM machine used to take 8.5 minutes, now takes 36 minutes.

Restart on a more powerful AMD machine used to take 2 minutes, now takes 6 minutes.

All the extra time occurs when "Pushing ledger state" from 22% to 45%.

System info (please complete the following information):

Logs

Logs from more powerful AMD machine running cardano-node version 8.9.3:

Jun 25 08:27:21 relay1 systemd[1]: Started cardano-node.service - Cardano Node.
... snip ...
Jun 25 08:28:39 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:28:39.58 UTC] Opened lgr db
Jun 25 08:28:39 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:28:39.58 UTC] Started initial chain selection
Jun 25 08:28:39 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:28:39.92 UTC] Pushing ledger state for block bc48fc99af50edfd1f0443d1c735743797d11c511d2ca36d1fb375c0b164b33e at slot 127657338. Progress: 0.00%
Jun 25 08:28:54 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:28:54.80 UTC] before next, messages elided = 127657363
Jun 25 08:28:54 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:28:54.80 UTC] Pushing ledger state for block be03bf8d2c52b0f81ea0f3496af44c9462262d0b003d632f742c69cb6b7ee2db at slot 127667331. Progress: 22.52%
... snip ...
Jun 25 08:32:58 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:32:58.36 UTC] before next, messages elided = 127667424
Jun 25 08:32:58 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:32:58.36 UTC] Pushing ledger state for block 489b5d6a0914bed294f47750e38f5b7ce6d118e804bd505ffcb8be3136a010ff at slot 127677401. Progress: 45.22%
... snip ...
Jun 25 08:33:16 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:16.05 UTC] before next, messages elided = 127677490
Jun 25 08:33:16 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:16.06 UTC] Pushing ledger state for block 2852062d7f2186f1dd8556b78639993b2a46868ca2ea1067f9d8751e06fd9d25 at slot 127687464. Progress: 67.90%
... snip ...
Jun 25 08:33:33 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:33.02 UTC] before next, messages elided = 127687563
Jun 25 08:33:33 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:33.02 UTC] Pushing ledger state for block 0b7b08976bcb3a8ece7da3bb95a3816d7394d456917b5983cf2077d54e8409ae at slot 127697558. Progress: 90.65%
... snip ...
Jun 25 08:33:40 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:40.25 UTC] before next, messages elided = 127697621
Jun 25 08:33:40 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:40.25 UTC] Pushing ledger state for block 9d2e8da71a8732120e7aa2ec1b4facf731789643484c23e7ac9cda0864598aad at slot 127701707. Progress: 100.00%
Jun 25 08:33:40 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:40.25 UTC] Valid candidate at tip bc48fc99af50edfd1f0443d1c735743797d11c511d2ca36d1fb375c0b164b33e at slot 127657338
Jun 25 08:33:40 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:40.25 UTC] Initial chain selected
Jun 25 08:33:40 relay1 cardano-node[183103]: [relay1:cardano.node.ChainDB:Info:9] [2024-06-24 22:33:40.25 UTC] Opened db with immutable tip at d25ceb3b1a757d5f874c62d0f93526722c5609ede55b31c9c5ad82e50f089bf7 at slot 127657337 and tip 9d2e8da71

The AMD machine is more powerful but it has similarly seen a significant increase in restart time with all the time increase occurring when pushing the ledger state from 22% to 45%.

From restart this machine took 1.5 minutes to get to "Pushing ledger state" progress 22%. But then to progress from 22% to 45% took 4 minutes. While the progress from 45% to 67% and 67% to 90% took only 18 seconds. In the past it only took around 15-20 seconds to progress the ledger state from 22% to 45% on this machine.

Something in the ledger is now making the nodes take a lot longer to push their ledger state. I also checked a relay running version 8.9.2 (similarly compiled with ghc 9.8.2) and it also now takes much longer to restart with all the extra time occurring when pushing the ledger state from 22% to 45%.

Additional context Something has changed in the ledger which is causing the node software to do a lot more processing. Maybe there is some particular transaction causing a problem??? Maybe there is a bug???

btbf commented 2 months ago

A similar issue is occurring on my server. It used to sync in around 4 minutes, but I just restarted it and it's stuck at 60% even after 15 minutes.

OS: Ubuntu cardano-node: 8.9.2 CPU: 14core

TerminadaPool commented 2 months ago

Things are getting much worse.

Restarting the node used to take 8.5 minutes on my ARM machine but it is now taking over 2hrs!!! (It takes over 1hr "Pushing ledger state" from 68% to 90% and then nearly another hour "Pushing ledger state" from 90% to 100%.)

One of my AMD machines that used to take 2.5 minutes to restart just took 35 minutes!!!

I am now quite fearful to restart my node running on a Contabo vm because this might take a couple of hours.

Any ideas what is going on???

HeptaSean commented 2 months ago

https://cardanoscan.io/address/addr1q9lywtk836axkaf985822lh033aj5l6nauau8kg7fhdkze9hmc43c96exwsre5xktq4td5h2mzfjmayhtuk44ryy4uas72wh4t and two other addresses push senseless transactions with 194 withdrawal validators each every minute or so causing all blocks to be almost full and all nodes to constantly have to evaluate these validators (that withdraw 0 ADA from unused stake addresses).

EDIT: And it started almost exactly 24 hours ago with https://cardanoscan.io/transaction/ab02ef16b0d863bfe4bf9f488873c967adfd461eb774c27689ac011227ea4f9f, so that fits your observation 13 hours ago.

rayanebel commented 2 months ago

Hello here,

Same thing on my side and I was not able to understand the reason but, event starting from a fresh snapshot, it took ~40 min to my node (producer & relayer) to process ledger state

Replaying ledger from snapshot at 345c904e5201ec9ef9d21dfdb060144b8055201bedc820e84351dfe76c6e68f8 at slot 127704286
Replayed block: slot 127704296 out of 127722624. Progress: 0.05%
Replayed block: slot 127720769 out of 127722624. Progress: 89.88%
Opened lgr db
Started initial chain selection
Pushing ledger state for block c326b4051567c51c7b16415e0e87a6ca622225eec3ed7597b4d0dc4402037d79 at slot 127722628. Progress: 0.00%
Pushing ledger state for block 5efd56b57a8e97c334fa0b72bc5a657b50c447a6e54bbc1a0f50ed7ce3774324 at slot 127732616. Progress: 19.67%
Pushing ledger state for block 665d46763ccff34f3e395d4a0ff46b63a5d61417a65260eb1eae819dad710790 at slot 127732648. Progress: 19.74%
Pushing ledger state for block b4bede6f51369880235eee856e0302bfd6696a16a6a441629d694f2a6d533675 at slot 127732667. Progress: 19.77%

I've also discovered that our node is loading a very old ledger state on a old slot (~19h ago). I thought the node took a new snapshot every 72 minutes. am i wrong ?

We are running our cardano instance with 4CPU and 32GB of RAM with an NVME disk of 250GB.

version: 8.9.3

rayanebel commented 2 months ago

Hi,

How can we easily check if we can safely restart our nodes ?

@HeptaSean How did you find this transaction quickly ? did you check each transactions for the last 24h ?

TerminadaPool commented 2 months ago

I just restarted my slow ARM node and it started in 11 minutes which is not unusual and certainly a lot better than over 2hrs like it was taking yesterday. The best I normally see when restarting this ARM node is around 8.5 minutes.

HeptaSean commented 2 months ago

@HeptaSean How did you find this transaction quickly ? did you check each transactions for the last 24h ?

Someone in the IOG Discord saw the huge transactions on eutxo.org and traced them to that address, those addresses. Then started looking at them in more detail.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

TerminadaPool commented 1 month ago

The issue was fixed by "hotfix" releases 8.9.4 and 8.12.2.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.