filecoin-project / go-f3

Golang implementation of Fast Finality for Filecoin (F3)
Apache License 2.0
10 stars 7 forks source link

Mainnet bootstrap strategy, how to get the power table? #596

Closed Kubuxu closed 1 month ago

Kubuxu commented 2 months ago

Option 1: Save the power table in a new field in the PowerTable actor during migration Option 2: Bootstrap from chain lookback, oh-shit-store, initial power table cid snapshots, in first update after upgrade Lotus includes initial power table CID in binary.

### Tasks
- [ ] Enumerate the options (with pros/cons or in a decision table)
- [ ] Get decision made with stakeholders
- [ ] Update FIP with decision
- [ ] Lotus implementation work
- [ ] Forest implementation work
Stebalien commented 2 months ago

So... I still want to do option 2, but option 1 is nice because it requires no coordination and it doesn't preclude option 2. It does require a small FIP update, but I don't expect it'll be that controversial.

The issue with option 2 is that the CAR "roots" are currently expected to be a tipset. Ideally, we'd have a single root metadata object pointing to the chain and whatever else we want, but... that's not what we have right now.

Stebalien commented 2 months ago

So, I'd say go with option 1 and punt option 2 into the future.

Stebalien commented 2 months ago

Proposal:

  1. Add a new Option<Cid> field to the power actor that stores the power table, post bootstrap.
  2. When the network version bumps to v25, record the power table in the cron tick. We're using a network version so we can disable F3 by disabling a migration.
  3. Expose this power table via some F3InitialPowerTable -> Option<Cid> function.
  4. Change the bootstrap logic: At every epoch, lookback finality (900 epochs) and call F3InitialPowerTable. If it returns a power table, bootstrap F3 with that power table.
Stebalien commented 2 months ago

Note: the alternative is to do this in the migration itself. However, I'd like to:

  1. Do some mainnet testing with the final-final version before launching F3.
  2. Avoid 2 state migrations.
Stebalien commented 2 months ago

Ah, so, we need all the worker keys. This is best done through a migration of some form, unfortunately.

Stebalien commented 2 months ago

Ok, discussed with @jennijuju: we can do two migrations but avoid migrating the actor code in the second migration. Instead, the second migration will just create the power table and attach it to the power actor.

rjan90 commented 1 month ago

Some open questions for option 2 is how do we write the migration? Is there a need to create a nv-skeleton in Lotus/GST/Filecoin-FFI? Will it be similar to the Lightning/Thunder upgrade?

We should also give Forest a early heads up on our strategy here, so that they can prep for this migration.

BigLep commented 1 month ago

Additional 2024-09-11 conversation:

I added these tasks to the issue description:

Please update/correct where wrong or outdated.

Stebalien commented 1 month ago

We discussed the migration option in standup. Unfortunately, Forest would have to implement the migration as well and the migration will likely be non-standard (likely) because we don't want to bump the actors version to make the migration small. We can still do that, but we need to discuss it with them.

We also discussed some alternatives:

  1. When a user syncs from a snapshot after the F3 bootstrap epoch (syncs from a snapshot that doesn't include the power table from the bootstrap epoch), either (a) require that they use a new client version that hard-codes the bootstrap powertable CID or (b) require that they pass said power table CID on the command-line when importing a snapshot. The downside of this approach is that it requires user intervention and could cause issues for automated deployment setups.
  2. We could just add the CID to snapshots (e.g., stuff it in the CAR header). From what I can tell, this isn't too terrible, but it's kind of an abuse of the CAR format. This will require cooperation from chainsafe (they produce the snapshots) but the effort should be minimal.
  3. We could preserve the F3 bootstrap state-tree (both in the datastore and in the snapshot). This will require some work, but not too much work. However, this will keep extra state around, which will grow the datastore a bit.
Stebalien commented 1 month ago

I've discussed this with the F3 team and @jennijuju and it sounds like option 1 isn't so bad after all.

We'd have two releases:

  1. Release A: Before the network upgrade.
  2. Release B: Immediately after the network upgrade.

Release A will have (a) an environment variable to specify the F3 bootstrap power table CID, (b) the ability to specify it when importing a snapshot, and (c) will be able to import snapshots without specifying the variable (?) (we'll have to assess the risk of this as the peer won't be able to participate in F3).

Release B will be identical to release A except the bootstrap power table CID will be set.

We'll need to coordinate with Forest/Venus to make sure this works for them.

Stebalien commented 1 month ago

While writing this up, I did have another thought... technically, we can start late and our certificate store even supports this (technically). To bootstrap, we:

  1. Fetch the earliest finality certificate signed by a power table we have.
  2. Validate that finality certificate.
  3. Start from there.
ruseinov commented 1 month ago
  • Fetch the earliest finality certificate signed by a power table we have.
  • Validate that finality certificate.
  • Start from there.

Correct me if I understand this wrong: we're looking for the earliest cert signed by the current PT and then just verify all the subsequent certificates until the boostrap is finished.

Stebalien commented 1 month ago

Correct me if I understand this wrong: we're looking for the earliest cert signed by the current PT and then just verify all the subsequent certificates until the boostrap is finished.

Basically?

By bootstrap, I mean the F3 bootstrap (network-wide, not local to the current client). The issue here is that the F3 bootstrap epoch may be far enough in the past such that we no longer have a power table.

We have a bit of a lookback for the power table so it's a little more complex but the lookback is at most 990 epochs (+/-). So:

  1. I fetch the latest certificate (no verification yet). Call it cert A.
  2. I fetch the certificate 10 before that (still no verification). Call it cert B. The power table committed in the "head" of the chain finalized in this certificate should be the power table used to verify cert A.
  3. I load the power table from the head tipset referenced by cert B from my state (snapshot).
  4. Then I validate cert A with this power table.

All this tells me is that some 2/3rds of the power within the last 990 epochs claim that cert A is correct. But that should be good enough for our purposes here.

ruseinov commented 1 month ago

Correct me if I understand this wrong: we're looking for the earliest cert signed by the current PT and then just verify all the subsequent certificates until the boostrap is finished.

Basically?

By bootstrap, I mean the F3 bootstrap (network-wide, not local to the current client). The issue here is that the F3 bootstrap epoch may be far enough in the past such that we no longer have a power table.

We have a bit of a lookback for the power table so it's a little more complex but the lookback is at most 990 epochs (+/-). So:

  1. I fetch the latest certificate (no verification yet). Call it cert A.
  2. I fetch the certificate 10 before that (still no verification). Call it cert B. The power table committed in the "head" of the chain finalized in this certificate should be the power table used to verify cert A.
  3. I load the power table from the head tipset referenced by cert B from my state (snapshot).
  4. Then I validate cert A with this power table.

All this tells me is that some 2/3rds of the power within the last 990 epochs claim that cert A is correct. But that should be good enough for our purposes here.

Right, that makes sense. And an alternative would be to store the power table as part of the first migration to then be able do the above without lookbacks, correct? The lookback approach seems good to me as long as it goes smooth. And if it does not - there's always another try, assuming that if f3 is somehow broken - we're just falling back to normal EC.

Stebalien commented 1 month ago

Right, that makes sense. And an alternative would be to store the power table as part of the first migration to then be able do the above without lookbacks, correct?

Yes. The issue is getting that power table when restoring from snapshot without messing with the snapshot format this late in the game.

We considered snapshotting the power-table on-chain and storing it there, but we'd rather not touch the chain (that and it would have been two migrations in a row).

ruseinov commented 1 month ago

Yes. The issue is getting that power table when restoring from snapshot without messing with the snapshot format this late in the game.

Well, afaik we will still need to accommodate certificates when it comes to snapshots, but indeed if it's possible to do without - much better.

Stebalien commented 1 month ago

Well, afaik we will still need to accommodate certificates when it comes to snapshots, but indeed if it's possible to do without - much better.

Yeah, I'd like to eventually ship certificates in snapshots but it's a bit late to try to ship that before the release.