The F3 <-> EC fork issue

Stebalien commented 1 month ago

There's a theoretical issue where, if F3 takes a long time to finalize a tipset, it might cause long-range forks in EC. We're alleviating this by:

Never trying to finalize the current head (#716).
Preventing clients (e.g., lotus) from accepting finality certificates that would revert beyond EC finality (https://github.com/filecoin-project/go-f3/issues/717).

However, the issue still persists. The core of the issue is that:

F3 gets a chain from EC.
F3 can spend an arbitrary amount of time trying to finalize it.
In the meantime, EC fork away from the head F3 ends up deciding on.

To fix this, we likely need some way for F3 to discard the current proposal (if too old) and get a new one from the client. However, this is tricky to implement in the current GPBFT protocol without breaking the liveness guarantees.

There are really two parts to this issue:

Reducing the likelihood of long-range (10+ epochs) forks (to avoid breaking client assumptions).
Preventing forks beyond EC finality.

However, the catch is that nobody can emit two decide messages for the same instance without potentially breaking GPBFT. But there are also certain decisions that are simply unacceptable.

Stebalien commented 1 month ago

The actual solution may have multiple phases:

In phase one, we operate normally and try to finalize any valid chain.
In phase two, we try to avoid phase 1 by somehow skewing towards the heavier chain?
In phase three, we go back to trying to finalize any valid chain. We're accepting the fact that we're likely going to have a long-range fork

But it looks like any solution will have to involve feedback between GPBFT and EC:

GPBFT needs to know when it's taking too long. In that case, we want to decide on base ASAP so we can get a new proposal.
EC should maybe consider switching chains based on GPBFT. E.g., if we see a quorum of quality messages for some prefix, we may want to eagerly switch to that chain because it'll likely be finalized.

jennijuju commented 1 month ago

EC should maybe consider switching chains based on GPBFT.

If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?

jennijuju commented 1 month ago

What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?

Stebalien commented 1 month ago

If I understand correctly, the issue is that f3 participants needs to be notified if there the longest EC chain blocks is different with whaat they are finalizing over with today - so shouldn't this be the other way around -> F3 should maybe consider switch chains base on EC?

Both.

GPBFT should switch if it's taking too long and trying to finalize something EC isn't building on.
EC should try to build on what GPBFT is likely to finalize to reduce the chances of (1) being an issue (and to increase the chances of building on the right chain).

Stebalien commented 1 month ago

What will happen today if the chain receive a finalized set of blocks that doesn't matches EC longest chain blocks?

We switch to the F3 finalized chain no matter what.

vukolic commented 1 month ago

It was by design that in this case EC should win and GPBFT should simply halt. There was a long discussion and we decided to prefer EC availability over GPBFT consistency in this case. So if GPBFT does not finalize in 900 epochs then EC takes over again.

Stebalien commented 1 month ago

That's the current plan but...

Forks shorter than 900 epochs are still an issue.
If we have a network incident, we don't want to have to worry about breaking F3. This would especially be an issue once we get trustless bridges.

Stebalien commented 1 month ago

Ideas from a discussion with @anorth:

We can finalize F3 when we switch to the decide phase instead of waiting for the certificate. Implementing this is a bit complex because it relies on the GPBFT state and not just the certificate store, but it shouldn't be too difficult and doesn't affect the protocol.
We can bias base in converge if the fork is too long. This is a protocol change but it should be fine (we'll need to discuss it more). It won't save us if we get stuck in prepare/converge, but it'll reduce the chances of us forking.
As discussed in #716, we can increase the EC lookback. But, probably by more than 1.

Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.

vukolic commented 1 month ago

This is a good point. I suggest the following

If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

Let me know if you get what I am suggesting - and if not I can describe in more details.

vukolic commented 1 month ago

Ideas from a discussion with @anorth:
1. We can finalize F3 when we switch to the decide phase instead of waiting for the certificate. Implementing this is a bit complex because it relies on the GPBFT state and not just the certificate store, but it shouldn't be too difficult and doesn't affect the protocol.

2. We can bias base in converge if the fork is too long. This _is_ a protocol change but it _should_ be fine (we'll need to discuss it more). It won't save us if we get stuck in prepare/converge, but it'll reduce the chances of us forking.

3. As discussed in [Set the default EC lookback to 1 #716](https://github.com/filecoin-project/go-f3/issues/716), we can increase the EC lookback. But, probably by more than 1.
Note: I'm mostly concerned about the initial bootstrapping of the network. Once we're under way and have healthy participation, I think we'll be fine. But until then, we could get into a situation where we hover around the power cutoff to reach consensus which could cause us to get stuck in various phases, leading to these long-range forks.

Please avoid protocol changes. This issue has nothing to do with GPBFT as a protocol and would appear in any finalization protocol. hence the solution is not to be looked for in changing GPBFT.

vukolic commented 1 month ago

If there is a fork of EC while F3 finalizes - reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

This said EC must commit not to fork before the last F3 finalization point - this is in the F3 specification. Otherwise there is no point in calling F3 a finalization protocol...

Kubuxu commented 1 month ago

reject F3 finalization (do not apply it to EC) and restart new F3 instance finalizing the tail of the new chain since the last applicable F3 finalization.

How would that look? At that point, F3/GPBFT has produced a finality certificate that Filecon rejects due to consensus rules, but that rejection is not observable to consumers of only the certificate chain. Thus, it would require not a new instance but a new F3 certificate chain/network.

hanabi1224 commented 1 month ago

A side note: I ran into an issue that F3 tries to finalize tipsets that are newer than the EC head. To reproduce:

run Forest with F3 sidecar
sleep the machine
wake up after a few hours

This seems to happen when certexchange is getting cert that contains newer EC head while the node is still catching up

Stebalien commented 1 month ago

That's expected and something that needs to be handled, unfortunately. On the bright side, it makes syncing easier (you now have a guaranteed sync target).

filecoin-project / go-f3

The F3 <-> EC fork issue #718