AntelopeIO / leap

C++ implementation of the Antelope protocol
Other
116 stars 68 forks source link

IF: Command in leap-util to reset persisted safety data for finalizer to LIB #1576

Closed arhag closed 8 months ago

arhag commented 1 year ago

Ultimately, we want to persist the reversible blocks database in a smarter way (along with the QC chains) so that enough data is available post crash to safely recover liveness without losing any blocks. That is the plan as further enhancement to Leap post 5.0

For Leap 5.0, we only durably store the minimal information for a finalize machine to remain safe after a nodeos process crash (see https://github.com/AntelopeIO/leap/issues/1521). It then relies on the other nodes in the network having enough information to allow it to safely recover to the point where it can start participating in voting as part of the HotStuff algorithm and contribute to liveness.

But if enough finalizers in the network suddenly crash around the same time, they may lose important liveness data (highest QC, reversible blocks up _b_lock) that prevents them from even collectively working together to safely recover liveness for the network. In this case, the blockchain can keep producing reversible blocks but LIB would not advance.

To recover from such an extreme situation prior to post 5.0 enhancements, we need a backup mechanism that allows a finalizer to compromise their safety protections for the greater goal of allowing the network to recover liveness. This mechanism is also useful in the case where the finalizer loses or accidentally deletes the file in the blocks/reversible directory that persists the information needed to protect their safety; note that nodeos will attempt to "fail safe" if starting up with that file missing which comes at the cost of liveness.

This backup mechanism should be provided as sub-commands within a finality command in the leap-util program. First, there should be a sub-command to simply explore the entries in the persisted file. Perhaps there should be another sub-command to delete an entry referenced by the BLS finalzier public key from within the persisted file. And, more pertinent to this issue, there should be a sub-command that (re)sets the entry associated to the specified BLS finalizer public key to set _vheight and _b_lock information within the entry as if the last irreversible block was the _b_lock block and the _vheight was the block height of the last irreversible block and a phase counter of 2. This makes it so that the node will be able to immediately participate in the finality consensus process with block proposals built directly off the last irreversible block (which enough nodes must have durably persisted to disk), thus enabling liveness but at the risk of safety.

arhag commented 8 months ago

Overcome by events.

Now the finalizer can just delete the finalizer safety information file.