ava-labs / avalanchego

Go implementation of an Avalanche node.
https://avax.network
BSD 3-Clause "New" or "Revised" License
2.13k stars 670 forks source link

Lag/stop in C-Chain atomic trie commit cause long downtime on restart #1595

Open leopaul36 opened 1 year ago

leopaul36 commented 1 year ago

Describe the bug I am running multiple mainnet archival nodes ("state-sync-enabled": false for C chain) and I noticed that sometimes there is a lag in the committed atomic trie process.

For example, I restarted the node at 06-06|12:47:39.910, we can see in the logs that the shutdown was smooth:

INFO [06-06|12:47:39.910] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:949: Blockchain stopped

But then on restart, it has a lot of blocks to fetch/execute which provoked a multiple hours downtime:

[06-06|12:51:01.328] INFO <C Chain> proposervm/vm.go:356 block height index was successfully verified
[06-06|12:51:01.332] INFO <C Chain> snowman/transitive.go:90 initializing consensus engine
INFO [06-06|12:51:01.337] <C Chain> github.com/ava-labs/coreth/plugin/evm/vm.go:1171: Enabled APIs: eth, eth-filter, net, web3, internal-eth, internal-blockchain, internal-transaction, avax
[06-06|12:51:01.344] INFO <C Chain> bootstrap/bootstrapper.go:115 starting bootstrapper
[06-06|12:51:14.679] INFO <C Chain> common/bootstrapper.go:244 bootstrapping started syncing {"numVerticesInFrontier": 1}
[06-06|12:52:04.914] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 5000, "numTotalBlocks": 45492, "eta": "6m47s"}
[06-06|12:52:54.860] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 10000, "numTotalBlocks": 45492, "eta": "5m56s"}
[06-06|12:53:38.089] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 15000, "numTotalBlocks": 45492, "eta": "4m52s"}
[06-06|12:54:16.050] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 20000, "numTotalBlocks": 45492, "eta": "3m51s"}
[06-06|12:54:51.283] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 25000, "numTotalBlocks": 45492, "eta": "2m58s"}
[06-06|12:55:34.621] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 30000, "numTotalBlocks": 45492, "eta": "2m14s"}
[06-06|12:56:24.199] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 35000, "numTotalBlocks": 45492, "eta": "1m33s"}
[06-06|12:57:07.685] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 40000, "numTotalBlocks": 45492, "eta": "48s"}
[06-06|12:57:59.160] INFO <C Chain> bootstrap/bootstrapper.go:495 fetching blocks {"numFetchedBlocks": 45000, "numTotalBlocks": 45492, "eta": "4s"}
[06-06|12:58:06.461] INFO <C Chain> bootstrap/bootstrapper.go:554 executing blocks {"numPendingJobs": 45492}
[06-06|12:58:36.478] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 41, "numToExecute": 45492, "eta": "9h14m35s"}
[06-06|12:59:06.841] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 87, "numToExecute": 45492, "eta": "8h45m12s"}
[06-06|12:59:41.725] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 104, "numToExecute": 45492, "eta": "11h32m55s"}
[06-06|13:00:12.048] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 121, "numToExecute": 45492, "eta": "13h4m51s"}
[06-06|13:00:42.085] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 175, "numToExecute": 45492, "eta": "11h11m39s"}
[06-06|13:01:12.158] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 231, "numToExecute": 45492, "eta": "10h6m25s"}
[06-06|13:01:42.198] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 301, "numToExecute": 45492, "eta": "8h59m50s"}
[06-06|13:02:12.260] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 400, "numToExecute": 45492, "eta": "7h41m49s"}
...
[06-06|20:47:44.093] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 45457, "numToExecute": 45492, "eta": "22s"}
[06-06|20:48:19.094] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 45468, "numToExecute": 45492, "eta": "15s"}
[06-06|20:48:41.459] INFO <C Chain> queue/jobs.go:224 executed operations {"numExecuted": 45492}

What bugs me is that the latest committed atomic trie operation log prior to restart was this one:

INFO [06-06|12:24:59.565] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x82d6742989f07672825adeecdd7104b8e04e47760a4ffe297cdec70541767fc9 height=30,937,088

But at the actual time of the restart, the block height was ~30982738. It seems logical that the node had to fetch 45492 blocks as it roughly the difference between the latest commit and current height.

I did restart a bunch of time but could not reproduce on this node, the committed atomic trie operations are keeping up to date with the current height every 4096 blocks and the restarts are fast.

I have a bigger problem on another node (that is currently up and synced), the latest logs committed atomic trie operation log dates back to 48 hours:

INFO [06-05|09:05:40.399] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0xec30bf9fc37404f621fa594d3d9375906d7eea44c79e724bc3f6d8ff1277b4b1 height=30,801,920

So I guess that whenever I will restart this node, the downtime for resyncing will be super long.

Expected behavior C-Chain to be bootstrapped quickly on node restart.

Logs Attaching logs from the node. avalanchego_logs_2306080955.tar.gz

Operating System

Additional context I noticed some handling sync message took longer than expected and handling chan message took longer than expected messages in the logs and I'm wondering if this might be related to my problem.

This might be a https://github.com/ava-labs/coreth/ issue, please move if needed.

meijiesky commented 1 year ago

I met almost the same issue. I restart the node and it seems to me the node is resyncing all over again.

leopaul36 commented 1 year ago

I am still encountering this issue after resetting the node with a fresh database.

I can see the committed atomic trie messages were happening on time every 4096 blocks but it stopped:

root@archival-mainnet-01:/var/log/avalanche/avalanchego# grep -r "committed atomic trie" C.log
INFO [07-01|21:48:29.053] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0xcee158b7d5edd08f2b3c12d54b1eb87d4c6b89eee56f3efea08ab20f00b0b963 height=32,055,296
INFO [07-02|00:03:19.439] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x1560321c1d24cf1b84260320408f2675de728c0123eba8d91d01ba2c1303ad00 height=32,059,392
INFO [07-02|02:37:17.515] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x09e9018b02ca82397ef7af606e3eab682d671b23876bab5fa39ec92e706fc8e0 height=32,063,488
INFO [07-02|04:43:31.913] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x9d88e1be540c2ab41e36c42f460fb16c2998cf472bc628ecf42d8a7cb4f55cdb height=32,067,584
INFO [07-02|10:42:03.716] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x6ea265842e7c248a0c014abe3a3898cda051339ab79e92929b364e8a855676a3 height=32,071,680
INFO [07-02|14:58:27.647] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x97c0c8220d61444b2d3129351b23dc7894526d452ab8a2e4a5da1e49f1520e46 height=32,075,776
INFO [07-02|19:45:05.071] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x45b9208a1c783743be7b032a0837bc005cb25a7f0cdf926092c9bc84f8bc8adf height=32,079,872
INFO [07-03|03:09:45.085] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x9d09050420d40475bb74192acc5ce2c770d3449d5d5199b002a3695195e3f0cc height=32,083,968
INFO [07-04|03:32:45.407] <C Chain> github.com/ava-labs/coreth/plugin/evm/atomic_trie.go:209: committed atomic trie root=0x3cf094fffc0527d412cbb1d6f0095391076399a202412a58ca840f446feef715 height=32,088,064

Latest commit was now more than 75k+ blocks ago so the downtime after restart will be very long.

I am still seeing a lot of handling sync message took longer than expected messages like:

[07-04|09:00:11.862] WARN <C Chain> handler/handler.go:487 handling sync message took longer than expected {"processingTime": "8m7.572991289s",

Could this be related?

This node has 32GB RAM / 8 vCPUs and is doing nothing else other than running AvalancheGo:

image

Also attaching the latest logs of the node:

avalanchego_logs_230704.tar.gz

github-actions[bot] commented 1 year ago

This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.