AntelopeIO / spring

C++ implementation of the Antelope protocol with Savanna consensus
Other
9 stars 5 forks source link

Bugs in how fsi is handled #534

Closed arhag closed 2 months ago

arhag commented 2 months ago

There are a few problems with the way finalizer safety information (fsi) records are handled in the code.

First, there is an implementation issue in which everywhere in the code that sets a block_ref in the fsi is not correctly setting the finality digest (leaving it the default construct empty hash) despite it existing as a record in block_ref. This does not break anything since block_id is currently being used to check to see if a locked block or last voted block is an ancestor of a block being considered for voting. But the data stored to the fsi file would not be correct since block_ref is not constructed properly. To allow the compiler to assist in finding all locations where this happens, a constructor should be added to block_ref that takes all three fields so that the compiler guarantees the appropriate data is provided everywhere a block_ref is set in the fsi.

It may be necessary to keep the concept of an empty block_ref to capture the concept of no last_vote in the fsi. One way of achieving this is to add a default constructor for block_ref that allows creating this appropriate empty representation. A cleaner way (though one that adds an extra byte in the serialization of fsi) is to remove the concept of an empty block_ref and instead change the type of last_vote in the fsi to std::optional<block_ref>.

Second, there was a design flaw related to rule 2 (the rule that determines whether to vote strong vs weak) that needs to be corrected. It is possible for the latest (weak) vote to mask that the vote prior (weak) vote for a block not on the same branch. Then if later an attempt is made at a strong vote on a block with a latest QC claim block timestamp less than that of block the prior weak vote was on, decide_vote may (incorrectly) conclude it can vote strong because the latest vote was on a block that is ancestor of the block it is considering voting on. In reality, a strong vote would violate rule 2 (see https://github.com/AntelopeIO/spring/issues/92#issuecomment-2287315555) since it would commit to an implied time interval that contains the timestamp of the block the prior weak vote was on despite that block not being an ancestor of the block it is considering voting on. Therefore, decide_vote would have to vote weak in that situation.

To correct this masking problem, a slightly stricter check than what is required for rule 2 can be implemented. A new boolean called votes_forked_since_latest_strong_vote should be tracked in the fsi. A default fsi can start with votes_forked_since_latest_strong_vote set to false. And a strong vote should always set votes_forked_since_latest_strong_vote to false. On a weak vote, if votes_forked_since_latest_strong_vote was already true, then it should remain true. The only way votes_forked_since_latest_strong_vote can change from false to true is on a weak vote for a block b where the last_vote references a block that is not an ancestor of b.

In the if (can_vote) block https://github.com/AntelopeIO/spring/blob/7bd7153e4897e333ec1c31b5f62ec3b63c5b9d39/libraries/chain/finality/finalizer.cpp#L67-L70 we still need to decide if the vote will be strong or weak. However, it no longer needs last_vote_range_start to determine that; last_vote_range_start should be removed from the fsi. Instead, it can first compare fsi.last_vote.timestamp <= bsp->core.latest_qc_block_timestamp(). If that condition is true (or if fsi.last_vote is empty), then we immediately know that we can vote strong. If it is not true, then we can only vote strong if votes_forked_since_latest_strong_vote == false and bsp->core.extends(fsi.last_vote.block_id).

These changes to fsi do not require versioning. We still have the opportunity to retroactively change the fsi definition and we should take advantage of that rather than complicating the logic further by adding a new version. We should not bother supporting the old (incorrect) version that still tracked last_vote_range_start.

Finally, there were other small things I noticed that do not necessarily need to be changed as part of this issue, but is worth noting:

  1. We only consider strong QCs and thus strong votes as evidence from the blockchain when updating the fsi during syncing (prior to first vote). I believe we decided to do this because we were only concerned about the monotonicity condition and the safety and liveness conditions related to the lock in the fsi; we had a stricter protection for monotonicity based on using the nodeos startup time to protect against votes on earlier timestamp, and the lock is only updated on strong votes. However, there is useful information from weak votes as well. Weak votes may not update the lock in the fsi, but they do update the last_vote and now (with the changes in this issue) the votes_forked_since_latest_strong_vote boolean as well. Even ignoring the monotonicity concern and assuming we have the most up-to-date lock, a stale value for last_vote or votes_forked_since_latest_strong_vote may incorrectly force decide_vote to allow a strong vote when it should have only allowed a weak vote (a rule 2 violation). We could instead consider all votes found as evidence for updating the fsi (prior to first vote) while syncing. While recognizing that the exact order in which the votes were originally made may not be respected and some subset of votes may be skipped (either because not they were not found or because they were ignored from the monotonicity check due to processing out of order), it is still desirable to update the fsi according to the same logic that is normally followed. That means that if the monotonicity condition is satisfied, the last_vote (and possibly votes_forked_since_latest_strong_vote) should be updated even if the lock is not because the new lock from the vote does not have a better timestamp then the one already in the fsi.
  2. The condition https://github.com/AntelopeIO/spring/blob/7bd7153e4897e333ec1c31b5f62ec3b63c5b9d39/libraries/chain/finality/finalizer.cpp#L16-L18 in decide_vote should never be satisfied (and so the log should never be printed) because if the last_vote is empty, then the montonicity condition should necessarily be satisfied.
  3. The code https://github.com/AntelopeIO/spring/blob/7bd7153e4897e333ec1c31b5f62ec3b63c5b9d39/libraries/chain/finality/finalizer.cpp#L39-L42 is not necessary because if bsp->core.last_final_block_timestamp() >= fsi.lock.timestamp then it is necessarily true that bsp->core.latest_qc_block_timestamp() > fsi.lock.timestamp.
greg7mdp commented 2 months ago

regarging point 3 in "small things I noticed that do not necessarily need to be changed as part of this issue, but is worth noting":

Say the finalizer was offline for a while, and the lock is on an old block. Couldn't we have advanced bsp->core.last_final_block_timestamp() even though fsi.lock.timestamp is still referring to an older timestamp?

arhag commented 2 months ago

Say the finalizer was offline for a while, and the lock is on an old block. Couldn't we have advanced bsp->core.last_final_block_timestamp() even though fsi.lock.timestamp is still referring to an older timestamp?

Yes. But if that happens then for the best head block you necessarily have bsp->core.last_final_block_timestamp() < bsp->core.latest_qc_block_timestamp(). So if fsi.lock.timestamp <= bsp->core.last_final_block_timestamp(), then fsi.lock.timestamp < bsp->core.latest_qc_block_timestamp(), which means the liveness condition is satisfied and that code in the if statement would never run anyway.

greg7mdp commented 2 months ago

Yes. But if that happens then for the best head block you necessarily have bsp->core.last_final_block_timestamp() < bsp->core.latest_qc_block_timestamp(). So if fsi.lock.timestamp <= bsp->core.last_final_block_timestamp(), then fsi.lock.timestamp < bsp->core.latest_qc_block_timestamp(), which means the liveness condition is satisfied and that code in the if statement would never run anyway.

Makes sense, I'll remove these useless lines when I get a chance!