Ultimately, we want to persist the reversible blocks database in a smarter way (along with the QC chains) so that enough data is available post crash to safely recover liveness without losing any blocks. That is the plan as further enhancement to Leap post 5.0
For Leap 5.0, we only durably store the minimal information for a finalize machine to remain safe after a nodeos process crash (see https://github.com/AntelopeIO/leap/issues/1521). It then relies on the other nodes in the network having enough information to allow it to safely recover to the point where it can start participating in voting as part of the HotStuff algorithm and contribute to liveness.
But if enough finalizers in the network suddenly crash around the same time, they may lose important liveness data (highest QC, reversible blocks up _b_lock) that prevents them from even collectively working together to safely recover liveness for the network. In this case, the blockchain can keep producing reversible blocks but LIB would not advance.
To recover from such an extreme situation prior to post 5.0 enhancements, we need a backup mechanism that allows a finalizer to compromise their safety protections for the greater goal of allowing the network to recover liveness. This mechanism is also useful in the case where the finalizer loses or accidentally deletes the file in the blocks/reversible directory that persists the information needed to protect their safety; note that nodeos will attempt to "fail safe" if starting up with that file missing which comes at the cost of liveness.
This backup mechanism should be provided as sub-commands within a finality command in the leap-util program. First, there should be a sub-command to simply explore the entries in the persisted file. Perhaps there should be another sub-command to delete an entry referenced by the BLS finalzier public key from within the persisted file. And, more pertinent to this issue, there should be a sub-command that (re)sets the entry associated to the specified BLS finalizer public key to set _vheight and _b_lock information within the entry as if the last irreversible block was the _b_lock block and the _vheight was the block height of the last irreversible block and a phase counter of 2. This makes it so that the node will be able to immediately participate in the finality consensus process with block proposals built directly off the last irreversible block (which enough nodes must have durably persisted to disk), thus enabling liveness but at the risk of safety.
Ultimately, we want to persist the reversible blocks database in a smarter way (along with the QC chains) so that enough data is available post crash to safely recover liveness without losing any blocks. That is the plan as further enhancement to Leap post 5.0
For Leap 5.0, we only durably store the minimal information for a finalize machine to remain safe after a
nodeos
process crash (see https://github.com/AntelopeIO/leap/issues/1521). It then relies on the other nodes in the network having enough information to allow it to safely recover to the point where it can start participating in voting as part of the HotStuff algorithm and contribute to liveness.But if enough finalizers in the network suddenly crash around the same time, they may lose important liveness data (highest QC, reversible blocks up
_b_lock
) that prevents them from even collectively working together to safely recover liveness for the network. In this case, the blockchain can keep producing reversible blocks but LIB would not advance.To recover from such an extreme situation prior to post 5.0 enhancements, we need a backup mechanism that allows a finalizer to compromise their safety protections for the greater goal of allowing the network to recover liveness. This mechanism is also useful in the case where the finalizer loses or accidentally deletes the file in the
blocks/reversible
directory that persists the information needed to protect their safety; note that nodeos will attempt to "fail safe" if starting up with that file missing which comes at the cost of liveness.This backup mechanism should be provided as sub-commands within a
finality
command in theleap-util
program. First, there should be a sub-command to simply explore the entries in the persisted file. Perhaps there should be another sub-command to delete an entry referenced by the BLS finalzier public key from within the persisted file. And, more pertinent to this issue, there should be a sub-command that (re)sets the entry associated to the specified BLS finalizer public key to set_vheight
and_b_lock
information within the entry as if the last irreversible block was the_b_lock
block and the_vheight
was the block height of the last irreversible block and a phase counter of 2. This makes it so that the node will be able to immediately participate in the finality consensus process with block proposals built directly off the last irreversible block (which enough nodes must have durably persisted to disk), thus enabling liveness but at the risk of safety.