Write Ahead Log & Rebroadcast

Stebalien commented 6 days ago

We've implemented limited rebroadcasting to help lagging/restarting nodes catch up. However, if the 66%+ of the network crashes after starting an instance but before sending a single decide message, the network could decide on two different values for the same instance.

A simple solution here is write-ahead logging. That is:

Before sending any message (maybe limit it to commit messages?) log/sync the message to disk. To save space, we could just record a single message template.
On restart, re-load all (or maybe the last round? commits only?) messages from the last instance started but with no decision.
Rebroadcast those messages and resume from that point.

Of course, nothing will help if the actual disks die. But this will at least help us recover in case someone finds a way to crash the entire network all at once.

The specific attack I'm worried about is as follows:

An attacker listens for commit messages and waits until they see a quorum (enough to reach a "decision").
The attacker checks to see if it knows of a better tipset at the same height (more weight). E.g., the attacker may choose to withhold a block to make this happen.
The attacker then uses some previously unknown exploit to crash all lotus nodes.
The attacker submits a certificate for the "forgotten" decision to some bridge.
The network restarts/resumes.
The network agrees on a different value.
The bridge is now borked.

Stebalien commented 6 days ago

An alternative is to wait an instance. That is, always consider the latest finality certificate as "pending" until one has been built on-top-of it. We can do this safely due to the power table lookback. The network would have to be willing to "switch" decisions while the latest instance is still pending.

This lookback won't be completely transparent to the client, but shouldn't be that hard to implement....

Kubuxu commented 3 days ago

We discussed this in person. The alternative is not a good solution because we don't have a hash link (and even the existence of that additional finality certificate has serious consequences).

filecoin-project / go-f3

Write Ahead Log & Rebroadcast #392