share Zoe/ERTP libraries among contracts

What is the Problem Being Solved?

@dtribble pointed out that we'd really like to share the Zoe/ERTP libraries between different contracts, so that each one doesn't need to bundle its own copy. This feeds into a fairly large (and really cool) feature, in which vats/programs/contracts are defined as a graph of modules, some of which are shared, some of which are unique to the vat/program/contract, and we amortize space/time/auditing/trust by taking advantage of that commonality.

Currently, each contract is defined by a starting file (in JS module syntax) which exports a well-known start function. Typically this starting file will import a bunch of other modules: some written by the contract author, but many coming from Zoe and ERTP (math helper libraries, Issuer, etc). The deployment process feeds this starting file to our bundleSource function, which is responsible gathering all the necessary code into a single serializable "bundle" object. This bundle is then transmitted to the chain, where it can be used to create new contract instances. The bundle is stored (as a big data object) in the Zoe vat when the contract is first registered (#46 is basically about storing this somewhere more efficient). On its way to Zoe, the bundle appears as a message argument in several intermediate steps: the vat that executes the deploy script, several comms/vattp vats, and some cosmos/tendermint transaction messages. Once on Zoe, each time the contract is instantiated, Zoe must send the bundle to a newly-created dynamic vat, which creates several more copies of the bundle data.

The starting file lives in a node_modules/ -style package directory, in which some particular version of Zoe and ERTP has been installed (e.g. node_modules/@agoric/ERTP/ contains those library files). bundleSource follows the usual Node.js rules to consult package.json and find the files to satisfy each import statement. ERTP depends upon several other Agoric-authored modules, and those files get included too. We should build some tools to measure this, but I could easily believe that only 10-20% of the bundle contents come from the contract definition, while the rest comes from common libraries.

The problem is that the resulting resource consumption is the product of three multiplicands: the size of the bundle (which includes both the unique top-level code plus all the supporting libraries), the number of times it appears during the installation/instantiation of a contract (i.e. showing up as arguments in several messages, through several vats), and the number of times it gets installed and/or instantiated.

The #46 blobstore work may reduce the times the data appears (by moving it out of messages and into some deeper kernel-managed data store), but it's still interesting to reduce the effective size of the bundle. The subtask of this ticket is to accomplish that by not storing multiple copies of data that is shared between multiple bundles, so we're only paying the cost of the unique components of each contract.

Simplifying the Solution

I love this topic because it overlaps with the Jetpack work I did many years ago. There are fascinating questions of early- vs late- linking, programmer intent expressed by import statements as petnames that are mapped through a community-fed (but locally-approved) registry table into hashes of code/behavior, auditing opportunities at various levels of the aggregation/build/delivery/evaluation/execution process, and dozens of other ideas that I've been itching to implement for a decade.

But, to get something done sometime in the foreseeable future, I should narrow the scope somewhat. I'm thinking of a solution with the following pieces:

when the contract source code is deployed, instead of bundling everything into a single big blob (the nestedEvaluate module format), bundleSource() should emit an artifact that contains a graph of modules (packages, imports, exports) where all the actual code is identified by hash, and a big table mapping hash to the code string for that one module
when this bundle-artifact is sent to the chain, we should have an out-of-band pathway to let the uploading client interactively figure out which components the chain already knows about, and only send the unique ones. this will reduce the upload bandwidth.
the swingset on the chain should store the components in a datastore that can accept hashes and return strings (#46 includes this, the "blobstore"). This part could (and probably should) happen without any actual swingset messages, so the contents of the blobstore do not need to be part of the consensus state. However we'll need a way to protect against abuse and spam.
I think the graph of modules itself could also be uploaded and stored as a blob, and identified by hash.
the first consensus-visible message would be "install contract" to Zoe, and would carry a blobcap, or a hash which zoe can then redeem for a blobcap. The blobcap is tiny, just like an object reference. Zoe does not need the actual source code of the contract, it just needs a secure identifier (so callers can ask for it and be certain that it covers the same code they're expecting to interact with)
when someone instantiates the contract, Zoe will create a new dynamic vat with ZCF loaded in, then Zoe will send the contract blobcap to ZCF, and ZCF will call importBundle with the bundle identifer. At this point, we need our Compartments and Endo to let us supply the module graph in pieces, pulled from the blobstore, rather than needing to touch every byte of every module through a syscall (to keep the large blob data out of the transcripts). Maybe a special syscall.importEndoArchive(moduleGraphBlobcap). This doesn't need to go back to the kernel, but it should be able to fetch all the module source code from the blobstore, and evaluate it into a new graph of Compartments.
eventually, we'll also want to be able to snapshot the new vat just after the contract is instantiated, but just before anyone else talks to it, so we can use that snapshot as a "zygote" (#2268) for all subsequent instantiations of the same contract. This points to more integration between Endo's module loader and the XS vat manager which knows how to make snapshots (and how to reference them for later duplication and reloading)

Starting Points

We can achieve intermediate savings by just implementing portions of this plan. The most important thing is Endo, so we can supply a module graph as a bunch of pieces, rather than only as a single monolithic bundle object. When ZCF does an importBundle, it supplies a bunch of objects (maybe a graph and a table of blobs). We still deliver this collection of blobs everywhere (no message-size savings), but Zoe can deduplicate them for storage, so Zoe doesn't consume extra RAM or secondary storage for the redundant copies of the libraries. That'd be the first win.

The second win could come if the deployment script could send a list of hashes to Zoe, and receive back a list of the hashes it doesn't already know about. Then the deployment script could send a subset of the component module sources. Zoe would hash them and store them indexed by their hash. Then the deployment script sends the module graph piece (which references everything else by hash), instead of sending the redundant library data.

The third win will be to move this storage into the kernel, so Zoe can send blobcaps to ZCF instead of the full sources. We have to figure out the right API for mapping between hashes and blobcaps (it would be nice if userspace code didn't know about hashes, only opaque blobcaps). Once the prerequisites are in place, one approach would be for Zoe to send a big table of blobcaps to ZCF, ZCF uses syscalls to retrieve the bytes for each blobcap into RAM, ZCF reconstructs the module-contents table, then ZCF feeds the module graph and the contents table to an Endo-based importBundle. A second (better) approach would be for ZCF to give a single module-graph blobcap to vatPowers.importBundle or syscall.importBundle, and have something outside of userspace do the blobcap lookups to find all the components that Endo needs to load the contract module graph.

Security Considerations

When we share modules between separate contracts, we are of course only sharing "static module records" (basically the source code of each module). We do not share instances, so that e.g. a Map defined at the top level of some module cannot be used as a communication channel between unrelated contracts. We also want to prevent this sharing between multiple imports of the same module graph within a single vat/contract. We know that more sophisticated tools could make this sharing safe (and perhaps saving us some RAM), by checking the module for the "DeepFrozen" property, but that involves a lot of static-analysis language work that we're not going to do in the near future.

To achieve savings, we'll be removing source code from the bundle and replacing it with references to source code that arrive via a different path. This must not enable someone to swap out source code. The use of hash-based identifiers should prevent this, but we must implement it properly: use a suitably secure hash function, and make sure nothing outside the hashed content can influence the resulting behavior. The API for adding blobs to the blobstore should accept data, not a hash, so there is no temptation for the blobstore to merely accept the word of the submitter (instead, the blobstore should compute its own hash, obviously, store the data under that computed hash, and then return the hash so the caller can confirm it matches their expectations).

For now, I think the deployer-submitted module graph should identify all modules by the hash of their contents. A later extension might replace this with e.g. a well-known library name and version identifier, or even a looser constraint like "the best version of libfoo that is compatible with API version 12". This would change the authority model: rather than the developer choosing exactly the source code to use, they would leave that choice up to something on the chain. On the plus side, this gives some later authority (perhaps managed by a governance vote) an opportunity to fix bugs and improve performance without the involvement of the original author. On the other hand, this enables interference by third parties, and expands the end-users TCB to include those upgraders

Some updates in the 2.5 years since we first established this trajectory:

we don't really need a blobStore: we've removed most of the userspace copies of the bundle contents by having everything refer to BundleCap handles instead of full bundle records. The only userspace code that observes the whole bundle is when ZCF does importBundle(D(bundleCap).getBundle), or when the CORE_EVAL governance proposal loads the second-stage payload, and those are irreducible until we build some sort of magic Endo/Liveslots access to the module loader
the EndoZipBase64 format does what we need: the bundleID is a hash of only the compartment-map.json, and the loader produces behavior that is a deterministic function of its contents (it checks hashes of the modules, and rejects+ignores extra files)
we don't need any Endo changes to implement this: the kernel can synthesize bundles from the moduleTable and the compartmentMapTable, and then return the complete bundles to bundleCap.getBundle()
- although it's worth having a small LRU, so we don't keep re-building the ZCF bundle
step one will be:
- add swing-store tables modules and compartmentMaps (or reuse bundles to hold the compartment maps)
- change bundleStore.addBundle to parse the bundle into modules, check hashes, install the new ones, and install the compartment map
- change bundleStore.getBundle to check the LRU cache, on miss it will re-synthesize the bundle from compartmentMaps and modules (and add to the LRU cache)
- the schema upgrade step for this change should disassemble all existing bundles into modules, and delete the bundle table when we're done
- we might add some rows to the bundles table to help us track how large each bundle is, without having to rebuild it just to make that query
- to minimize work, we might choose to leave exports alone, by re-synthesizing all bundles during export
- or, we might reach further and change the export format to have modules and compartment maps as artifacts, and not bundles
- the export-data will still be just bundle.${bundleID} records, but we might augment the values to include more data if that avoids forward references during import
- the requirements are: check integrity on all modules, no missing modules, allow unused modules (a bundle might have been deleted, some day once we implement that, and we might not have implemented refcounts on modules to delete them too), check integrity on all compartment maps
- at the end of this step, we'll save DB space, but uploaders will still have to deliver (and pay for) their full bundles, and there will be no changes to either the cosmic-swingset -> kernel API, nor the kernel -> swingstore API
step two will be:
- add swingstore APIs for bundleStore.installModule and installCompartmentMap
- add controller.validateAndInstallModule(module) -> moduleID and controller.validateAndInstallCompartmentMap(compartmentMap) -> bundleID
- add a cosmos x/swingset message type for installModule and installCompartmentMap
- add some RPC query to get a list of installed moduleIDs
- write external tooling that can compare the modules of a local bundle against the target chain's agd query something something modules and build a list of missing ones
- use that in the proposal builder (the part that would normally create an install-bundle txn) to produce a txn with one or more installModule messages and a final installCompartmentMap message
- at the end of this step, uploaders will only need to upload their new modules, and the compartment map

The reduction in cost is a good thing, but it also removes a deterrent against spam and abuse. I'd prefer that we implement some sort of format-check/filtering to the installModule and installCompartmentMap handlers before we expose them. installCompartmentMap should check that the body is JSON, with only the expected keys and value shapes (we might add more in the future, but only if Endo supports them, which will require a new liveslots to hold the new Endo, which will require a chain-software upgrade to deliver the new liveslots, so I think it's safe to be strict about the shape). installModule could check that the result is well-formed UTF-8, at least, but it'd be nice if we could somehow check that it is parseable as a JS module (can we run some small portion of Endo's bundle-source tools on it at install time?). We could also limit abuse by imposing a moderate size limit to each installModule (maybe 100kB), or to apply exponential fees to large ones, to encourage authors to split up their code into smaller (and more-likely-sharable) pieces.

Agoric / agoric-sdk