Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

define worker-v1 package #6596

Open warner opened 1 year ago

warner commented 1 year ago

What is the Problem Being Solved?

To give us a coherent solution for #6361 , I think we need to define a new package that includes all the behavior-influencing parts of our XS-based workers. This package will have -v1 in the name, not the version, and in fact we will only ever publish a single version of this package: all subsequent changes will go into a new package, with -v2 in the name, which will also only get one release, etc.

Then we change SwingSet to depend upon this -v1 worker package, and add a note to the kernelDB per-vat metadata to record the fact that the incarnation is using v1, rather than some earlier or later version. Any version of SwingSet can host this worker and its history (especially its XS heap snapshots) without fear of behavior changes.

The nominal name of the new package is @agoric/worker-v1, but we might want something more swingset-specific.

Description of the Design

@agoric/worker-v1 will need to contain or import the contents of:

The kernel-facing API of this package should let it launch a worker and get back control facets to send deliveries down and process syscalls that come back.

Security Considerations

Test Plan

warner commented 1 year ago

Perhaps we should make the bundle files for this package during the build process (and publish them to NPM), so their contents are well-specified. Every yarn build is expected to create exactly the same contents. Endo's bundle-source includes version numbers in the generated bundle (as part of the component map), so the following pieces must be stable:

Having a yarn.lock should enforce that. There will be a very limited set of changes that could possibly be made (between the original worker-v1@1 and a subsequent worker-v1@2 release) that can still maintain this property, and none of them can affect the behavior of the generated bundle. E.g. we might find a new Endo version of bundle-source that bundles twice as fast as before: we can use that, as long as the bundles it creates are identical to the version used by @1. But a new bundle-source that creates bundles half the size as @1 could not be used.

kriskowal commented 1 year ago

We cannot rely on yarn.lock to pin exact versions of the transitive dependencies of an *-v* package in a workspace where more than one version shares a yarn.lock, since yarn is free to solve the dependency graph in a way that satisfies the most recent version. We either need these versioned worker packages to contain a checked-in bundle and exact xsnap binary, or we need each of these versions to have an independent yarn.lock. We should jam on this.

kriskowal commented 1 year ago

(That is, if yarn.lock is free to change after worker-v1 gets captured, worker-v1 behavior will change.)

warner commented 1 year ago

(hm, ok that may or may not torpedo the following plan)

I've spent more time thinking about this. The constraints are:

We have two bundles used to start an xsnap worker. The first is named "the lockdown bundle", and currently the entry point is SwingSet/src/supervisors/subprocess-xsnap/lockdown-subprocess-xsnap.js , which is a one-line library that imports @agoric/xsnap/lib/ses-boot.js, which is a 5-line library that does:

import { setObjectInspector } from './console-shim.js';
import '@endo/init';
import confinedObjectInspect from './confined-object-inspect.js';
setObjectInspector(confinedObjectInspect);
harden(console);

The second is named "the supervisor bundle", and the entry point is supervisor-subprocess-xsnap.js, which imports liveslots, creates an instance of liveslots, and prepares the worker to receive netstring commands.

Our bundles include the package name, package version, and relative pathnames of every file in the bundle, including the entry point. To generate identical bundles, it is not sufficient for the entrypoint file to have the same contents: it must also live in the same package, of the same package version, and at the same relative path.

That means swingset-worker-xs-v1 cannot hold the entrypoint file, since we want two separate versions of this package to be capable of using the exact same bundles (controlled differences). As @kriskowal pointed out, this really means the bundles should be generated as part of a build process, and published to NPM (with of course some simple process to regenerate them locally to compare against the published versions).

The Plan

So that makes me think the following plan could work:

worker-v1 #6596 - Frame 1

The light blue boxes represent packages, at various versions. The blue connecting lines are cross-package dependencies. The white boxes are JS files or JS-level exports of the package, and the black connecting lines are import statements.

The @agoric/swingset-kernel package (originally swingset-vat) is the anchor. The first version we publish only supports "worker-v1", but a later version could support both -v1 and -v2.

The "worker-v1" support is provided by a package named @agoric/swingset-worker-xs-v1. This depends upon three other packages.

The first is xsnap. This lives in a package named @agoric/xsnap at some specific version, which provides both the xsnap binary and the JS libraries which talk to it.

The second is a package named @agoric/xsnap-lockdown. This provides the lockdown bundle. In particular, it contains ses-boot.js, which imports Endo, and configures the console (to accomodate the XS print function that provides the sole access to stdout). The lockdown package depends upon some particular version of Endo, by virtue of a package.json dependency on the @endo/init package.

The lockdown package exports two API functions. The first is getBundleURL(), which returns a URL that points to a pre-generated bundle file. (maybe we make it async for future flexibility). The second is a getVersions() that returns an object with at least { version, bundleSHA } strings. The bundle would be generated by calling bundleSource() on an entry-point file that lives inside the lockdown package, at npm build time, so the bundle would be included in the tarball that gets uploaded to NPM. At the same npm build time, we hash the bundle and record it in a generated file that is consulted by getVersions(). Clients should, at the very least, record the expected bundleSHA in their own source code, and compare it against the retrieved value, at each application startup. More cautious clients should also independently hash the file at getBundleURL() and assert that it matches their expectation. These are to prevent compatibility/determinism surprises in case somethng in the lockdown package (or further downstream) gets updated somehow, like if yarn computes a different dependency solution. When a client decides to switch to a new version of the lockdown package, they should update their expectations accordingly.

Alternatively, we might choose to have getBundleURL() perform the bundling process in that moment, writing the contents to some cache directory. However the hash should remain the same. Using this approach would make it more obvious that the @agoric/xsnap-lockdown version also depends upon the @endo/bundle-source version used to make the bundle, however it is common that a newer bundle-source could still produce the same bundles, whereas it is exceedingly unlikely that bundling a newer @endo/init package could do the same.

The third package is the "supervisor package", which will live in @agoric/swingset-xs-supervisor at some particular version. This uses the same build/bundling process as the lockdown bundle, but its entrypoint is a supervisor.js, which has considerably more code, and is quite swingset-specific. This supervisor will import code from the @agoric/swingset-liveslots package (#6863).

The @agoric/swingset-worker-xs-v1 package exports a makeStartXSnap() function. This creates a new xsnap worker and prepares it for vat deliveries. If the options bundle provides a heap snapshot to start from, the new worker is loaded from that snapshot. If not, the new worker is empty, and makeStartXSnap() instructs it to evaluate the lockdown and supervisor bundles.

XS Compatibility Levels

When considering moving from one version of XS to another, we must think about how similar the new version behaves, compared to state or actions created under the previous version. There are two things for the new behavior to be compatible with:

A failure of new-delivery consistency would cause a validator to fall out of consensus. A failure of replay consistency might be caught by the individual validator as it noticed a discrepancy between the transcript-recorded syscalls (or delivery results) and the replay.

Now we can classify potential changes in XS behavior into 4-ish categories:

Changing XS

The compatibility level of XS determines how we might deploy it to the chain.

A-level compatiblity

Since A-level changes to XS do not cause observable behavior differences, we should be able to deploy them at any time. We would publish a new version of @agoric/xsnap with a micro version number bump (e.g. 1.0.0 to 1.0.1), which points to a different git-submodule commit hash of agoric-labs/moddable (for XS) or agoric-labs/xsnap-pub( for xsnap itself). Then we'd publish an @agoric/swingset-worker-xs-v1 package with a micro version bump, and we'd change @agoric/swingset-kernel to reference the new worker package. @agoric/swingset-worker-xs-v1 would continue to reference the same lockdown and supervisor packages as before.

Validators could switch to the new kernel at their leisure, without requiring a coordinated chain upgrade event.

B-level compatibility

If the XS changes are classified as "B-level", then the upgrade process requires conformance testing (and a little bit of luck, given that it might take a week from the time testing must stop to the time a chain-upgrade vote can complete). We'd produce the same packages as above, but we'd bump the minor version (e.g. 1.0.0 to 1.1.0), to indicate the lower level of compatibility. All validators would need to move to the new @agoric/swingset-kernel version at the same time.

(note: maybe the minor version isn't the right component to change; we might need four levels of version number to capture all of this).

C-level compatibiltiy

For C-level changes, the kernel will need to run both old and new versions for some period of time (since not all vats will reach their BOYD+heap-snapshot point at the same moment, and that moment certainly doesn't happen at the moment of chain upgrade). Therefore we must allocate a new @agoric/swingset-worker-xs-vNNN number (e.g. adding a package named @agoric/swingset-worker-xs-v2). This new package would depend upon a new xsnap version that can have a major(??) version bump

The new kernel version would depend upon both -v1 and -v2. The kernel would know (somehow? maybe the getVersions() return values??) that this upgrade must be perform at a heap-snapshot boundary, and would schedule a change from -v1 to -v2 at that point.

D-level compatibility

For D-level changes, we definitely need separate -v1 and -v2 packages. The kernel would wait until the next baggage-style vat-upgrade to swap in the new version. Only a change at this level is allowed to change the lockdown (SES/Endo) or liveslots bundles.

It seems likely that the kernel might not automatically upgrade vats across this boundary. Instead, the kernel upgrade changes the default "what worker should new vats use?" value, which can also be the value used when upgrading an existing vat. Then, userspace (e.g. contract governance committee action) causes a vat upgrade for each vat, and the bundles change during that upgrade.

Changing Endo/SES, or Liveslots

We create a new version of @agoric/swingset-lvieslots or Endo, then we make a new version of @agoric/swingset-xs-supervisor or @agoric/xsnap-lockdown. Then we make a new package like @agoric/swingset-worker-xs-v2, just like in the "D-level" change above. The kernel depends upon both -v1 and -v2, and a userspace-triggered upgrade takes advantage of the new versions.

warner commented 1 year ago

So here's a plan of attack:

mhofman commented 1 year ago
  • make a copy of supervisor-helper.js in there too

It would be great if during this process every file ended up as a git move so we can keep history somewhat traceable.

warner commented 1 year ago
  • make a copy of supervisor-helper.js in there too

It would be great if during this process every file ended up as a git move so we can keep history somewhat traceable.

Agreed, although I don't know how to make copies trigger git's heuristic. OTOH, if you're following it with git log --follow, you'll probably see it "created from nothing" at the commit that actually copied it, and I'll make sure the comment on that commit cites the place it came from, so at least the human running git log will know what second command to run to follow it further back.

warner commented 1 year ago

discussing this with @kriskowal , here are some notes:

We agreed that rearranging the packages as drawn above (except stick with a single @agoric/swingset-worker-xs for now, rather than distinctly-named variants) is a good starting point, as it is work that needs to be done in any of our sketched-out plans. So I'll start there.

warner commented 1 year ago

I'm thinking about how the repos and packages and workspaces ought to work, and how a "test against current versions of everything" (#7056) should work.

Tools at our disposal:

The new package dependency graph is:

graph TD;
  swingset-kernel --> swingset-worker-xsnap-v1;
  swingset-worker-xsnap-v1 --> xsnap;
  swingset-worker-xsnap-v1 --> xsnap-lockdown;
  xsnap-lockdown --> endo/init;
  swingset-worker-xsnap-v1 --> swingset-xsnap-supervisor;
  swingset-xsnap-supervisor --> swingset-liveslots;

and the goal is to allow swingset-kernel to depend on multiple versions of swingset-worker-xsnap-vNN at the same time, each of which will need very specific versions of its dependencies.

"thorough but painful" approach

One potential approach, with obvious drawbacks:

This would achieve our stability goals, but development would be horrible: every commit would have to go through NPM before the parent package could use it. We'd be publishing broken versions because we couldn't actually test the combination of new-child + new-parent before committing to the new-child patch.

"special kernel development workspace" approach

We might create a separate repo for use by kernel developers. It would define a yarn-v1 workspace at the top, and use git submodules to link in the five child packages so they'd all sit next to each other inside the workspace. If we made the cross-package dependency version constraints more lenient, then yarn install would satisfy them with symlinks. That would let us make matching changes in both child and parent package without going through NPM for each commit.

One problem is that git submodule leaves each embedded git repo checked out on a "detached HEAD": it records the commit ID (hash), but not a branch name. So when a developer commits a change in one repo, it isn't associated with any branch (like master or a feature branch to submit in a PR). We'd have to go through some ugly hoops: after the initial git submodule update clones everything and switches to the recorded commit ID, we'd need a script to change the submodules to a named branch. We'd also need it to change the git remote specification to allow developers to push to GitHub from the submodule (normally the .gitmodule file refers to a read-only HTTPS URL).

The non-kernel agoric-sdk code would continue to use a kernel from NPM, making a boundary between swingset and its clients that would have to go through NPM for each change (same problem as above, but at a different boundary, hopefully one which changes more slowly). Devs who want to test both sides could use yarn link to associate the two repos.

"yarn link all the things"

Another approach would be to use yarn link to associate multiple git repos within a single developer's working environment:

Depending on the versions and constraints, this would replace e.g. swingset-xsnap-supersivor's node_modules/@agoric/swingset-liveslots/ subdirectory (as installed from NPM) with a symlink to the nearby repo. We could the make changes in liveslots, cd to swingset-xsnap-supervisor, re-run yarn build to get a new bundle, etc.

When it comes time to publish a version to NPM, we'd use yarn unlink PACKAGENAME to remove all the symlinks, then do npm publish of each package, from the bottom of the graph and upwards.

"one repo but mess with workspaces a lot"

In this approach, we continue to use a single monorepo, but we use a release process that temporarily changes the definition of the workspace.

Development can continue in the tree, and running tests with WORKERTYPE=dev (or however we spell the environment variable that tells the kernel what managerType/workerType to use by default) will get the local versions (assuming you remembered to yarn build to update the bundles). Running tests without the environment variable wil get worker-xsnap-v1 (or whatever the latest published version is).

When it comes time to release the swingset packages, the sequence will look like:

Note that all of the swingset package release tags will point to commits on the side branch, and each will point to a distinct commit.

We must also change the agoric-sdk release/publish process:

The swingset packages will spend most of their time at .dev0 versions. In particular, the swingset-worker-xsnap package will be at .dev0, which means the kernel package's worker-xsnap-v1 alias dependency will not be satisfied by the version in the workspace, and can only be satisfied by the NPM-published version (with its stable hashed pre-bundled contents).

"something using a newer Yarn"

We're currently using yarn-v1, and support for workspaces has improved significantly in the later versions (currently yarn-v3.5.0). One improvement is the ability to use nested workspaces, called "worktrees". We could use this to nest a set of swingset packages in one workspace, under the general agoric-sdk workspace.

Another improvement is a dependency constraint syntax that says "only use something from my same workspace", spelled like workspace:*. This would let e.g. swingset-worker-xsnap to depend on "swingset-supervisor-xsnap": "workspace:*", so the local development would strictly use code from within the tree. In addition, when publishing packages, yarn can dynamically replace those constraints with versions being exported by the dependencies. This would remove the need to manually change package.json files back and forth during the release process.

which one?

I'm still trying to figure out a good plan here.

warner commented 1 year ago

Oh, before I forget, any change which creates new repos or workspaces will need to consider the patches/ directory, which (in conjunction with patch-package) we use to modify some of our dependency packages before use. We patch a lot of Endo packages, and those will be incorporated into the xsnap-lockdown bundle.

FUDCo commented 1 year ago

I haven't completely absorbed this, but a couple of somewhat scattered thoughts which possibly don't necessarily reflect the strongest possible understanding of the yarn/npm/git/blahblah toolchain:

mhofman commented 1 year ago

This is the first time I'm actually thinking about this problem, and I may be missing something, but I don't understand the direction taken right now, or the process taken to get there. In particular, a single "worker-v1" package amalgaming xsnap, endo, liveslots, and various dependencies.

I'm having an easier time thinking of it this problem in terms of API surface and upgradable units, and starting from the program itself. I also want to think about this as what the right approach conceptually should be, but it may not be feasible right now without other work throughout our stack.

Code running in a worker

Vat program

If we look at a JavaScript program executing in a vat, it has dependencies on the environment created by liveslots and endo, and possibly a dependency on a certain set of JS features. The program may also have dependencies on remote objects or devices, which are not part of the environment, but may be part of the initial execution of the program (currently defined as buildRootObject for a liveslots program). What it doesn't have is a direct dependency on xsnap, the kernel itself, or cosmic-swingset.

While these are dependencies in the conceptual sense, from the perspective of the program's package.json, all these are part of the execution environment, and not part of the package's "dependencies". There are 2 other package.json features that fit better to express these dependencies: "peerDependencies" and "engines". "peerDependencies" is used for packages that are relied on by the program (possibly imported), but which are expected to be installed / provided by the surrounding environment. "engines" roughly corresponds to the set of features available ambiently (either global, but also "built-in" packages).

One approach would be to have an "endo" engine version which describes the set of hardened JS features available in the locked down execution environment (as a combination of JS features implemented by the XS engine, and the modifications brought by @endo/init). To express how the program is executed, we'd need to use a second "engines" field indicating this program is meant to be invoked through "liveslots"'s buildRootObject. These "engines" dependencies would not result in code being included in the bundle of the program. However we'd need these "engines dependencies" to be documented by the bundle so they can be at least verified for compatibility by the kernel. I believe this information is currently lost when the bundle is created.

Zoe contracts

But looking at liveslots programs is only part of the picture. Most of the programs running in a vat are actually Zoe contracts, which rely on a specific version of the Zoe Contract Facet to invoke it (through an exported start or prepare) with a ZCF object of the right shape passed as argument. In that sense, the "engines" field of these programs should actually be describing ZCF, not liveslots. However Zoe contracts still likely rely on features provided by "liveslots", and an "endo" flavored hardened JS execution environment. This raises the question of how these "external" dependencies should be expressed, and which part of the system is responsible to enforce them.

Let's take an example. Let's say ZCF itself relies on the basic "liveslots@^1.0" feature set and documents it as such in its "engines". However we introduce a new liveslots feature in a backwards compatible way (e.g. new global), and bump the liveslots version to 1.1.0. The kernel/liveslots deals with executing ZCF, and could setup or verify the version of liveslots compatible with ^1.0. However the kernel/liveslots doesn't know anything about the contract or its requirements. ZCF would need to be the one responsible to check that the current version of liveslots is compatible with the contract's requirements, however ZCF itself cannot setup the environment, as it's already executing in it, only the vat creator has that opportunity.

There is also the question of how the Zoe contract expresses its reliance on liveslots@^1.1.0. Since it's not executed directly by liveslots, the "engines" field may not be appropriate anymore, and "peerDependencies" may be a better fit. However it seems that endo's bundler may currently attempt to bundle peerDependencies the same as regular dependencies (see https://github.com/endojs/endo/pull/1355).

Supervisor bundle

On the other side, we have the supervisor bundle. This currently evaluates as a script, but it may not stay like that (for example it could transition to an endo worker which I believe expects a main export). It relies on some JS features and a hardened JS environment with Compartment for example, as well as some globals to interact with the runtime and kernel (e.g. issueCommand and metering for xsnap workers). It also contains and sets up "liveslots" which will be used by the program.

For an xsnap supervisor, an "xsnap" entry for the "engines" would make sense to document the ambient APIs required. Similarly, if we introduce a sqlite directly in the worker process, having a way to document this external dependency would be needed. However what may not make sense as an "engines" requirement is the protocol version that liveslots speaks with the kernel over issueCommand/handleCommand, as this is something that fits the definition of "peerDependencies" much closer. Regardless, this "protocol dependency" is a requirement that liveslots and thus the supervisor bundle would need to express, and it needs to be fulfilled by the kernel when creating the vat. At some point we'll likely need the kernel to support multiple versions of this liveslots protocol. However this protocol version may be incremented independently of the liveslots version itself as required by the vat.

For example, we may move vatstore inside the worker, which would likely be a breaking change in the liveslots protocol between the kernel and the supervisor, and well as possibly in the protocol between the kernel and the worker process. However this may be transparent on the vat program itself, and only necessitate a minor version update of "liveslots" from its point of view.

Worker

This is the final and lowest level component in a vat, the one providing a hardened JS execution environment for the supervisor bundle, as well as the low level transport with the kernel (some of which is exposed to the supervisor, like the commands interface, and some which is only exposed to the kernel, like snapshots). As such it may provide 2 "engines" versions to the supervisor (e.g. "endo" + a worker specific one like "xsnap"), but it also needs some independent versioning with the kernel.

For example, a breaking change in an xsnap worker would be to change the snapshot format, but that would have no bearing on the supervisor or vat bundles executing in the vat (beyond the fact that changing the xsnap worker version requires going through the "upgrade" process for the vat program since we cannot restore the program's heap).

On the other side, a change to the metering API may be a breaking change between the xsnap worker and the supervisor, but may not impact the transport between the xsnap-worker process and the kernel.

Version requirements and checks

All this suggests that when creating or upgrading a vat, the "endo" and "liveslots" versions should be specified explicitly or derived from the bundle. However the xsnap or liveslot protocol versions are in fact transitive dependencies that do not belong expressed or checked at that layer. The major version of the worker would not be part of the options either: a null upgrade would implicitly use the latest major version of the worker compatible with the required "endo" and "liveslots" versions, and it then becomes a question of who triggers the null upgrade.

Swingset would need to handle multiple supervisor bundles and worker major versions. However these 2 don't have to be merged together (but can be to simplify things). More importantly, their versions is not directly equivalent to the "endo" and "liveslots" versions that the vat bundle requires.

This points to the need for the following mechanisms:

For example we could have the following layering:

At first, only the version 1.1.3 of xsnap-worker exists.

In a chain upgrade, version xsnap-worker@1.1.4 is introduced. The version is fully compatible, and is automatically used to restart existing workers

In another chain upgrade, xsnap-worker@2.0.0 is introduced, which switches to native compartments. Given the change, this is a breaking for both snapshot content and the lockdown bundle. A new version of xsnap-supervisor is introduced which only bumps the endo dependency version after verifying compatibility. A null upgrade of the vats is required. A new vat not compatible with the new endo version would still use the old worker, but new or upgraded vats which are compatible can switch.

Handling multiple versions

This requires the kernel to be aware of multiple versions of xsnap-worker / supervisor. Right now those cannot be introduced dynamically, so we can rely on source code changes in the kernel every time the xsnap-worker or supervisor changes.

One approach could be to always rely on NPM installed versions for the non latest package, and some dependency aliasing so that the kernel can depend on multiple versions of these packages. This does make it complicated to do patch or minor updates to the non-latest version, and does not prevent the supervisor from changing due to repo effects, unless the supervisor is published as a bundle.

However since we've disconnected the xsnap package and supervisor versions from the endo and liveslots versions they provide, I'm wondering if we actually need actual publishable node packages for these.

On the xsnap side, we really have a binary and a JS spawner. The binary is mostly in an external repo already, and we could keep the above scheme, aka install a binary version published by an xsnap repo which just builds xsnap-pub with the right moddable SDK. For "latest" work, we could keep a way to build the binary from a submodule instead. Then on the JS spawner side, we could have separate non published packages or even more simply source files that correspond to the various "xsnap-worker" versions, which would internally reference the required binary. These JS spawners could share some source code where applicable.

Similarly, for the supervisor, if we're careful to not have dependencies influenced by the rest of the repo, we could keep the different versions as sibling folders and files in the repo. To ensure we don't break our requirements of no changes to previous versions, we could either check-in the resulting bundle, or a hash of the expected bundle, for each version.

warner commented 1 year ago

I had a good brainstorming session with @michaelfig and @mhofman this morning, and I think we have a better plan than before:

That will get us stability of the two bundles for the lifetime of a vat incarnation. The stability is low-level (precise bundle contents, as opposed to a version string), which is good. We won't directly have a semver/human-meaningful name for that behavior: we'll talk about "the liveslots that shipped with agoric-sdk@1234" instead of "liveslots@456", but that's probably ok.

When we perform the agoric-sdk release process, we'll get a new semver version string for each bundle-creating package, and we'll create (and hash) a bundle as well. We won't learn the bundle's hash until that moment. The bundles ought to be reproducible: clone the SDK at that commit, that will include a particular yarn.lock, running yarn install ought to get all the same packages/versions into node_modules, then yarn build should be generating a bundle from all the same components.

Updating something in the agoric-sdk (e.g. with yarn add, modifying the yarn.lock) may or may not change what goes into the bundles. We believe that yarn's dependency resolver tries hard to retain the original versions: consider a package that lives in the transitive dependency set of both xsnap-lockdown and e.g. zoe, and then zoe is changed to require a newer version (which would still be accepted by the xsnap-lockdown constraint), we think yarn will include both versions rather than de-duplicate them (by increasing the version that xsnap-lockdown gets to one which means both packages' constraints). However we may want to deduplicate, just to clean things up, and that certainly would change the files that go into the bundle. We should be able to inspect the bundle hash (printed to stdout during yarn build) to tell whether something changed or not.

TODO: make sure the generated bundles are included in the NPM package (run yarn pack and inspect the tarball), change the package.json files: section if necessary.

After the vaults release, we'll need to decide how we get the second version of the bundles into place. One option is a controller.updateBundles() call, which would:

Other options/variants:

Note: a top-level yarn build will build each package in alphabetical order. We don't have bundles that include other bundles, but if we ever did, we'd need to make sure the leaves got build before the higher nodes of the dependency/inclusion graph, so e.g. xsnap-lockdown and swingset-xsnap-supervisor need to be built early.

xsnap

For now, we won't record any per-vat metadata to describe which version of xsnap (and thus which version of XS) is used. All workers will use the same xsnap, regardless of whether they were created weeks ago or just now.

This will constrain our ability to change XS/xsnap: any change we make must be sufficiently compatible to not cause problems when replaying a transcript (the old XS and the new XS must match transcripts) (this may be easier once #7142 is fixed and we're more confident that liveslots ignores GC timing variations).

We'll pretend that there's a per-vat metadata key that says which version of xsnap to use, and that it's currently set to "0" (i.e. lack of a key means "0"). Later, when we want to introduce a not-so-compatible version of XS/xsnap, we'll start using this key. At that point, we'll either need separately named xsnap packages (@agoric/xsnap-v1 vs @agoric/xsnap-v2), or we'll use the package.json > dependencies: aliasing feature to let the kernel code import multiple package name simultaneously, even though the package.json maps those onto different versions of the same @agoric/xsnap package name.

That is the point at which we'll figure out the semver attributes of these different xsnap packages, what "minor version has changed" means w.r.t. compatibility, etc. We can avoid committing ourselves to some of those details until we need the second version/package.

At that time, we'll update the kernel to record "current xsnap version identifier" into the DB, and use it when launching the worker. My original plan was to change the managerType into a worker field, giving it values of local and xsnap-v1, and in that approach, this is the point where we'd introduce xsnap-v2 (which would encompass all of XS, xsnap, lockdown, supervisor/liveslots).

In our new thinking, xsnap-v2 could refer to just the xsnap binary and the embedded XS library, letting you control both lockdown and supervisor as independent knobs.

Currently, package/xsnap/ hosts two git submodules, one named moddable (pointing at https://github.com/agoric-labs/moddable.git, our fork of the XS SDK), and a second named xsnap-native (pointing at https://github.com/agoric-labs/xsnap-pub, our fork of Moddable's xsnap repo). We're thinking it would be a good idea to make xsnap-pub host the moddable git-submodule, rather than packages/xsnap/ (changing the git-submodule graph from (A->B, A->C) to (A->B->C)). packages/xsnap/ knows how to build the xsnap binary, and knows the path to that binary, and has the API code which spawns a child process from that binary, plus the API code which speaks the netstring pipe to it.

Then, the git commit ID of the xsnap-pub submodule can serve has a name for the xsnap+XS behavior, and the commit ID of the lower-layer moddable submodule is a narrower name for the XS behavior, either of which might wind up in the per-vat metadata some day.

mhofman commented 1 year ago
  • when the kernel is initialized, we'll sample the lockdown and supervisor bundles, store them in the bundleStore under their hash IDs, and add a pair of kvStore keys to remind the kernel of what the "current lockdown/supervisor bundle ID" are

I believe @michaelfig suggested we could do this lazily and just load the bundle and hash from disk when needed by a new vat.

a top-level yarn build will build each package in alphabetical order

Is that true for all packages or only the ones with circular dependencies? I thought yarn correctly handled the order for packages that have no circular deps.

warner commented 1 year ago
  • when the kernel is initialized, we'll sample the lockdown and supervisor bundles, store them in the bundleStore under their hash IDs, and add a pair of kvStore keys to remind the kernel of what the "current lockdown/supervisor bundle ID" are

I believe @michaelfig suggested we could do this lazily and just load the bundle and hash from disk when needed by a new vat.

Yeah, as I wrote up my notes, I got worried about when exactly the DB writes would occur, and how we explain this to host-application authors. Let's see, I can think of three approaches:

I need to be able to write documentation aimed at host-app authors to explain when state changes might happen, and what their hostStorage.commit() responsibilities are. We need well-defined windows of time during which the kernel might make changes to the swingstore, after which the host app must do their commit.

"A" is the one that most closely matches the current pattern. It would always sample and stash the bundles during initializeSwingset(), at the same time it populates a bunch of other swingstore data (config.vats, bootstrap message on otherwise-empty run-queue, etc), and the host app already knows it needs to commit after this initialization phase, no surprises there. And requiring an explicit upgrade call would introduce another such phase, but only when the host app wants to do an upgrade. The host app would look more like:

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
if (timeToUpgrade) {
  updateSwingset(kernelStorage);
  hostStorage.commit();
}
const c = buildVatController(kernelStorage);
// .. go run blocks

DB writes are a deterministic function of the package upgrade schedule and the host app calling the "upgrade" function.

"B" might be convenient, but the host app would need to do a commit at every launch, just in case this was the time that a new bundle was found. DB writes would be a function of package upgrade schedule and the controller incarnation schedule (process start to process termination). buildVatController() does not currently make any state changes, and I'm not sure I want it to.

The host startup would look something like:

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
const c = buildVatController(kernelStorage);
hostStorage.commit(); // in case bundle upgrades happened
// .. go run blocks

"C" puts the changes in the middle of a controller.run(), so no problems with knowing when to commit. It would mean that the kernel must be given the ability to read the bundle files (encapsulated in some form of the startXSnap endowment, which also needs access to the bundleStore so it can store the sampled bundle, and read them back again later). The DB write will happen kinda haphazardly, during the first vat creation after each controller restart that happens after the on-disk bundle-supplying packages are changed. If a given controller incarnation doesn't create any new vats, the sampling will not happen, even if the package versions have changed. So the DB writes are a deterministic function of both the package upgrade schedule and userspace actions (creating new vats), which is kind of complicated.

It also adds a new failure mode: if someone launches the kernel, then modifies their source tree (e.g. checks out a different version, maybe temporarily), the DB state will depend on a race between their interference and a vat being created. If we used A or B, any potential sampling would be complete by the time kernel startup finished, so the window of vulnerability (to this admittedly "don't do that" case) is a lot smaller.

My Python t-shirt says "explicit is better than implicit":

% python3
Python 3.9.6 (default, Oct 18 2022, 12:41:40)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
>>>

But the annoying thing about "A" (explicit) is that we have to add a cosmic-swingset ActionType.UPGRADE_SWINGSET "action" to trigger the upgradeSwingset() call, invoked from the Golang-side upgrade handler function, sent to the JS side, and await it, before the remainder of the cosmos upgrade can continue. With "B" or "C" all you have to do is change the JS packages on disk. And we removed the bundle-the-kernel-and-store-it step last year because we liked the simplification of "just git checkout a different version and restart".

Hrm. I know I don't want to do "B". I prefer to be explicit. I think I'm ok with e.g. swing-store either spontaneously upgrading itself (when newer code is launched against an older DB), or refusing to run and requiring a separate program be run to do the upgrade, which suggests that maybe I'm ok with the kernel spontaneously upgrading the bundles without an explicit upgrade() call, which means that maybe explicit is not so important to me after all. I'm uncomfortable with deferring the bundle reads until semi-random times long after kernel startup. We're not saving and significant space or time by loading the bundles lazily. But I don't want to implement the extra cosmic-swingset actions or coordination code to trigger the swingset upgrade, nor do I really want to add "call upgrade()" to the instructions for host-app authors (feels like a burden).

Ok, I'll go with "C". At least with these reasons written down, if I change my mind later, I should be able to think more clearly about it.

michaelfig commented 1 year ago

A corollary to "C": do the sampling once on controller initialization, but don't write the results to the DB until the first time they're needed.

mhofman commented 1 year ago

I don't fully follow the timing issues: The packages providing the bundles would be loaded when loading swingset, but their methods to load the bundle and check the bundle hash would be invoked every time a vat is created. The authority is already encapsulated, no need to provide new raw authority to read the filesystem.

If someone modifies their source tree in a way that will cause the existing methods to return new values, then they will just fall off consensus. Aka don't change your source tree while executing, any lazy loading will have similar constraints.

Edit: to summarize, "C", but I don't share the concerns.

warner commented 1 year ago

Now that we have a new approach, I'm moving this ticket out of the milestone in favor of the (smaller) #7208 . We might choose to use this approach in the future, but we no longer think we need a distinct worker-v1 package for the Vaults release.

kriskowal commented 5 months ago

I move to close this issue since it is no longer on the plan of record. We can revive the idea if it eventually becomes necessary to have more than one worker version.