Non traumatic major XS upgrades

mhofman commented 1 year ago

What is the Problem Being Solved?

https://github.com/Agoric/agoric-sdk/issues/6361 describes conditions for which we can upgrade XS in a chain upgrade and have all vats use that version of XS going forward. The main expectation is that snapshot are at least compatible: the new version of XS can load from the old version of XS, and keep executing as previously recorded.

The main problem is of incompatible snapshots, such as when a major or minor version update of XS occurs, or when new globals are implemented by XS. All the other requirements are believed to be possible already: the execution as seen by the transcript in newer versions of XS will be the same as what was recorded in the previous version.

https://github.com/Agoric/agoric-sdk/issues/6361 references using multiple versions of XS (further defined in https://github.com/Agoric/agoric-sdk/issues/6596) and performing vat upgrades to switch vats to the newer version. This issue explores an alternative that doesn't introduce any upgrade trauma, nor requires multiple versions of XS being distributed.

Pre-requisite knowledge on the current implementation

While liveslots's implementation is still revealing organic gc, in https://github.com/Agoric/agoric-sdk/pull/7498 (and its follow up https://github.com/Agoric/agoric-sdk/pull/7552), we've basically hidden organic gc from liveslots. In https://github.com/Agoric/agoric-sdk/pull/7558 we make sure that the effects of snapshots (which perform a full forced gc) are not observable in transcripts after the snapshot is taken. We believe that together, this makes our vat transcripts fully independent of any engine allocation behavioral differences.

In https://github.com/Agoric/agoric-sdk/pull/7484 we introduced transcript entries that capture snapshot information (hashes) in the transcript. This makes the transcript somewhat dependent on the version of XS, but these are not actual deliveries, so they can be handled.

There is still the possibility that metering limits would cause a single crank to fail where it previously succeeded, but that is currently unlikely.

With the introduction of state sync (https://github.com/Agoric/agoric-sdk/pull/7225), validators may not have the full transcript content of previous spans (between the latest incarnation start, and the latest snapshot taken). However the hashes of previous spans are kept in the swing-store to support repopulating these historical transcript entries.

Newer versions of XS may introduce new intrinsics. In general these new intrinsics should not impact code execution, however our current SES version is sensitive to new well-known symbols (https://github.com/endojs/endo/issues/1577), and thus would fail on new XS versions that add any unsupported symbols.

Description of the Design

The general idea is to rely on vat transcript replays to regenerate the snapshots and transcript span hashes.

We believe that validators are ok with an upgrade taking some reasonable amount of time to complete (in the order of multiple minutes, likely less than an hour). As such we may be able to perform at least part of this vat transcript replay during the upgrade, but we likely want to streamline the process by making it possible to preprocess some of the replay task.

Replay and regeneration of transcript

The regeneration process would be roughly as follow:

Remove snapStore entries for the latest vat incarnation (optional)
Empty the hashes for the transcript spans of the latest vat incarnation
Start replaying using the new XS version from the first transcript span with an empty span hash
- an empty hash indicates a span that has not yet been replayed in the newer version
When reaching a snapshot save entry, regenerate new snapshot
- update snapStore hash and save snapshot
- remove the previous span's snapshot data in the snapStore
- update save transcript entry with new snapshot hash
- update next span's transcript load entry with new snapshot hash
When reaching end of span, save new rolling transcript hash in the span table
- this span has now successfully completed its replay

Offline pre-processing

This rough process allows doing partial replays of transcripts which can be later resumed. If applied as a pseudo-diff, it also allows the transcript to keep growing after being exported for offline processing:

An export of the swingstore (like the one used for state-sync or a future genesis export) is used to capture the artifacts and export data related to the latest incarnation of every vat.
- If artifacts for historical transcript spans are missing, they can be retrieved from an archive node out of band
- They are verifiable through the exported transcript span hashes
- the snapstore (or bundle) artifacts do not need to be exported
- the "export data" not related to validation of transcript data is not needed
The offline tool keeps track of:
- updates to the transcript entries (namely load an save snapshot entries)
- new span hashes being generated
- in the offline tool, an new span hash must not be generated/recorded for the last/"current" span
- new snapshot hashes and data
At upgrade, we perform the following:
- Empty the hashes for the transcript spans of the latest vat incarnation
- Lookup if any offline data exists for that vat incarnation, and apply the pseudo-diff
- proceed with the regeneration replay process, starting at the first span with an empty hash

Other replay considerations

To mitigate XS changes that impact the execution, it may be possible to change the lockdown or supervisor bundles used when replaying the vat (see #6929 for validation of new XS versions)

these new bundles are meant to fix compatibility, not to introduce new features. They cannot cause diverging behavior with recorded transcript
these new bundles should be reflected in the vat transcript

Security Considerations

All validators should perform these steps independently. If they share the "offline" data with each other, the chain is vulnerable to corruption. This is not too much of a concern as this process is verifiable.

Since the hashes being recomputed would be captured in the swingstore export to cosmos DB, a super majority of validators must agree on the result of the replay to be identical for the upgrade to succeed.

Scaling Considerations

The replay of multiple vats can be performed in parallel to speed up the restart process.

The offline partial pre-processing allows speeding up the time needed to replay during the actual upgrade

Test Plan

TBD, but likely using the docker based upgrade testing framework, verifying various scenarios such as offline processing capturing partial (older) vat transcripts, or a vat being upgraded after the capture is made.

FUDCo commented 1 year ago

Minor nit:

we've basically hidden organic gc from liveslots

That's not quite right. Liveslots sees organic gc but then does various things to hide it from user code.

mhofman commented 1 year ago

Nope, liveslots no longer sees organic GC because we couldn't trust liveslots to correctly hide organic gc impacts from the kernel (in which syscalls are made). We have always trusted liveslots to hide all gc (organic or forced) from user code.

FUDCo commented 1 year ago

Then what are those uses of WeakRef and FinalizationRegistry in the liveslots package doing?

mhofman commented 1 year ago

They are only cleared our during forced gc (bringOutYourDead and snapshots). See https://github.com/Agoric/agoric-sdk/issues/6784#issuecomment-1428041762

Edit: I updated the issue here to hopefully clarify the gc revealing story.

FUDCo commented 1 year ago

Thought: if a majority of a quorum of validators approves the results of a replay, the others could get the results via state sync rather than replaying themselves. If replays of different vats can be executed independently, you might be able to get some additional scaling by farming out different vats to different subsets of the validator population.

mhofman commented 1 year ago

you might be able to get some additional scaling by farming out different vats to different subsets of the validator population.

Unfortunately for consensus, we're in an all or nothing situation. A single validator need to come up with all the right answers. There is no way to vote partially on the result.

FUDCo commented 1 year ago

Yeah, this would be something like a mainnet 4 thing, when we start branching off interweaving sub chains and whatnot for scaling. I could imagine entities bidding for which vats should get priority in upgrade much as we anticipate bidding for priority in message delivery.

mhofman commented 11 months ago

A note that some changes to XS may end up having spec mandated execution differences, and thus directly observable by the program. While unlikely, this highlights that a replay based upgrade is not 100% foolproof, and that only an XS upgrade requiring a restart/upgrade of the vat is safe (see #8405). More details in https://github.com/Agoric/agoric-sdk/issues/6929#issuecomment-1744147408

Agoric / agoric-sdk