Open Stebalien opened 1 month ago
The actual solution may have multiple phases:
But it looks like any solution will have to involve feedback between GPBFT and EC:
EC should maybe consider switching chains based on GPBFT.
If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?
What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?
If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?
Both.
What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?
We switch to the F3 finalized chain no matter what.
It was by design that in this case EC should win and GPBFT should simply halt. There was a long discussion and we decided to prefer EC availability over GPBFT consistency in this case. So if GPBFT does not finalize in 900 epochs then EC takes over again.
That's the current plan but...
Ideas from a discussion with @anorth:
Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.
This is a good point. I suggest the following
If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.
Let me know if you get what I am suggesting - and if not I can describe in more details.
Ideas from a discussion with @anorth:
1. We can finalize F3 when we switch to the decide phase instead of waiting for the certificate. Implementing this is a bit complex because it relies on the GPBFT state and not just the certificate store, but it shouldn't be too difficult and doesn't affect the protocol. 2. We can bias base in converge if the fork is too long. This _is_ a protocol change but it _should_ be fine (we'll need to discuss it more). It won't save us if we get stuck in prepare/converge, but it'll reduce the chances of us forking. 3. As discussed in [Set the default EC lookback to 1 #716](https://github.com/filecoin-project/go-f3/issues/716), we can increase the EC lookback. But, probably by more than 1.
Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.
Please avoid protocol changes. This issue has nothing to do with GPBFT as a protocol and would appear in any finalization protocol. hence the solution is not to be looked for in changing GPBFT.
If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.
This said EC must commit not to fork before the last F3 finalization point - this is in the F3 specification. Otherwise there is no point in calling F3 a finalization protocol...
reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.
How would that look? At that point, F3/GPBFT has produced a finality certificate that Filecon rejects due to consensus rules, but that rejection is not observable to consumers of only the certificate chain. Thus, it would require not a new instance but a new F3 certificate chain/network.
A side note: I ran into an issue that F3 tries to finalize tipsets that are newer than the EC head. To reproduce:
This seems to happen when certexchange is getting cert that contains newer EC head while the node is still catching up
That's expected and something that needs to be handled, unfortunately. On the bright side, it makes syncing easier (you now have a guaranteed sync target).
There's a theoretical issue where, if F3 takes a long time to finalize a tipset, it might cause long-range forks in EC. We're alleviating this by:
However, the issue still persists. The core of the issue is that:
To fix this, we likely need some way for F3 to discard the current proposal (if too old) and get a new one from the client. However, this is tricky to implement in the current GPBFT protocol without breaking the liveness guarantees.
There are really two parts to this issue:
However, the catch is that nobody can emit two decide messages for the same instance without potentially breaking GPBFT. But there are also certain decisions that are simply unacceptable.