[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison

fulltimemike commented 6 months ago

🐛 Bug Report

Sometimes after a Client Node is restarted, the following error message will pop up: The next block (X) is invalid - Failed to speculate on transactions - Failed to post-ratify - Next round Y must be greater than current round Y. This error causes the client to stop syncing, and restarting the client further does not fix the syncing bug. To allow the client to continue syncing, the client ledger must be modified -- either the ledger must be reset to allow the client to resync from genesis, or a snapshot must be loaded into the client to continue syncing.

I'm uncertain whether this bug is directly in snarkOS, or if it is a problem with snarkVM. The specific error is thrown here.

Logs directly before the bug is thrown.

In this example, interestingly, blocks and rounds much further ahead (block: 185,032, round: 412383) seem to be logged and added to the ledger than the block and round identified in the error thrown (block: 111196, round: 252154). I'm not sure why the store is apparently adding previous rounds and blocks when it has already surpassed this point.

Steps to Reproduce

Across multiple canary net client nodes, we have observed behavior where restarting the node causes syncing to fail. This bug is nondeterministic, but we have seen that restarting a client node enough times will cause the error to pop up. It may be necessary for the client to be actively syncing during restarts to cause this bug, but I can't be certain.

Expected Behavior

Restarting a client node should not cause the client to get stuck permanently when syncing.

Your Environment

This environment is running on an EC2 linux machine, running a fork of snarkOS with commits up to https://github.com/AleoNet/snarkOS/commit/6aba25d9193c30c82c9762130499554f5c9fea1a.

Meshiest commented 6 months ago

Flat lines in this chart indicate the issue occurring

network topology:

10 validator devnet on AWS c6a.8xlarges
0 clients
no dedicated tx cannon

reproduce with some automation to reset the ledger of the same 2 every 30 minutes.

As early as within the first 500 blocks we frequently run into this issue on either or both of the 2 reset validators after reaching tip.

logs in gdrive

notes:

we are running a wrapper around snarkos to make checkpoints but the core snarkos code is only modified with the canary patch
rebooting from this state usually results in a "missing block hash" corrupted ledger error
we were able to reproduce this by locally running 10 validators on the same machine

raychu86 commented 6 months ago

We have a tentative fix for this issue for validators - https://github.com/AleoHQ/snarkOS/pull/3232. The fix is currently undergoing burn-in testing and internal verification.

AleoNet / anf-snarkOS