Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
303 stars 191 forks source link

share Zoe/ERTP libraries among contracts #2391

Open warner opened 3 years ago

warner commented 3 years ago

What is the Problem Being Solved?

@dtribble pointed out that we'd really like to share the Zoe/ERTP libraries between different contracts, so that each one doesn't need to bundle its own copy. This feeds into a fairly large (and really cool) feature, in which vats/programs/contracts are defined as a graph of modules, some of which are shared, some of which are unique to the vat/program/contract, and we amortize space/time/auditing/trust by taking advantage of that commonality.

Currently, each contract is defined by a starting file (in JS module syntax) which exports a well-known start function. Typically this starting file will import a bunch of other modules: some written by the contract author, but many coming from Zoe and ERTP (math helper libraries, Issuer, etc). The deployment process feeds this starting file to our bundleSource function, which is responsible gathering all the necessary code into a single serializable "bundle" object. This bundle is then transmitted to the chain, where it can be used to create new contract instances. The bundle is stored (as a big data object) in the Zoe vat when the contract is first registered (#46 is basically about storing this somewhere more efficient). On its way to Zoe, the bundle appears as a message argument in several intermediate steps: the vat that executes the deploy script, several comms/vattp vats, and some cosmos/tendermint transaction messages. Once on Zoe, each time the contract is instantiated, Zoe must send the bundle to a newly-created dynamic vat, which creates several more copies of the bundle data.

The starting file lives in a node_modules/ -style package directory, in which some particular version of Zoe and ERTP has been installed (e.g. node_modules/@agoric/ERTP/ contains those library files). bundleSource follows the usual Node.js rules to consult package.json and find the files to satisfy each import statement. ERTP depends upon several other Agoric-authored modules, and those files get included too. We should build some tools to measure this, but I could easily believe that only 10-20% of the bundle contents come from the contract definition, while the rest comes from common libraries.

The problem is that the resulting resource consumption is the product of three multiplicands: the size of the bundle (which includes both the unique top-level code plus all the supporting libraries), the number of times it appears during the installation/instantiation of a contract (i.e. showing up as arguments in several messages, through several vats), and the number of times it gets installed and/or instantiated.

The #46 blobstore work may reduce the times the data appears (by moving it out of messages and into some deeper kernel-managed data store), but it's still interesting to reduce the effective size of the bundle. The subtask of this ticket is to accomplish that by not storing multiple copies of data that is shared between multiple bundles, so we're only paying the cost of the unique components of each contract.

Simplifying the Solution

I love this topic because it overlaps with the Jetpack work I did many years ago. There are fascinating questions of early- vs late- linking, programmer intent expressed by import statements as petnames that are mapped through a community-fed (but locally-approved) registry table into hashes of code/behavior, auditing opportunities at various levels of the aggregation/build/delivery/evaluation/execution process, and dozens of other ideas that I've been itching to implement for a decade.

But, to get something done sometime in the foreseeable future, I should narrow the scope somewhat. I'm thinking of a solution with the following pieces:

Starting Points

We can achieve intermediate savings by just implementing portions of this plan. The most important thing is Endo, so we can supply a module graph as a bunch of pieces, rather than only as a single monolithic bundle object. When ZCF does an importBundle, it supplies a bunch of objects (maybe a graph and a table of blobs). We still deliver this collection of blobs everywhere (no message-size savings), but Zoe can deduplicate them for storage, so Zoe doesn't consume extra RAM or secondary storage for the redundant copies of the libraries. That'd be the first win.

The second win could come if the deployment script could send a list of hashes to Zoe, and receive back a list of the hashes it doesn't already know about. Then the deployment script could send a subset of the component module sources. Zoe would hash them and store them indexed by their hash. Then the deployment script sends the module graph piece (which references everything else by hash), instead of sending the redundant library data.

The third win will be to move this storage into the kernel, so Zoe can send blobcaps to ZCF instead of the full sources. We have to figure out the right API for mapping between hashes and blobcaps (it would be nice if userspace code didn't know about hashes, only opaque blobcaps). Once the prerequisites are in place, one approach would be for Zoe to send a big table of blobcaps to ZCF, ZCF uses syscalls to retrieve the bytes for each blobcap into RAM, ZCF reconstructs the module-contents table, then ZCF feeds the module graph and the contents table to an Endo-based importBundle. A second (better) approach would be for ZCF to give a single module-graph blobcap to vatPowers.importBundle or syscall.importBundle, and have something outside of userspace do the blobcap lookups to find all the components that Endo needs to load the contract module graph.

Security Considerations

When we share modules between separate contracts, we are of course only sharing "static module records" (basically the source code of each module). We do not share instances, so that e.g. a Map defined at the top level of some module cannot be used as a communication channel between unrelated contracts. We also want to prevent this sharing between multiple imports of the same module graph within a single vat/contract. We know that more sophisticated tools could make this sharing safe (and perhaps saving us some RAM), by checking the module for the "DeepFrozen" property, but that involves a lot of static-analysis language work that we're not going to do in the near future.

To achieve savings, we'll be removing source code from the bundle and replacing it with references to source code that arrive via a different path. This must not enable someone to swap out source code. The use of hash-based identifiers should prevent this, but we must implement it properly: use a suitably secure hash function, and make sure nothing outside the hashed content can influence the resulting behavior. The API for adding blobs to the blobstore should accept data, not a hash, so there is no temptation for the blobstore to merely accept the word of the submitter (instead, the blobstore should compute its own hash, obviously, store the data under that computed hash, and then return the hash so the caller can confirm it matches their expectations).

For now, I think the deployer-submitted module graph should identify all modules by the hash of their contents. A later extension might replace this with e.g. a well-known library name and version identifier, or even a looser constraint like "the best version of libfoo that is compatible with API version 12". This would change the authority model: rather than the developer choosing exactly the source code to use, they would leave that choice up to something on the chain. On the plus side, this gives some later authority (perhaps managed by a governance vote) an opportunity to fix bugs and improve performance without the involvement of the original author. On the other hand, this enables interference by third parties, and expands the end-users TCB to include those upgraders

warner commented 9 months ago

https://github.com/Agoric/agoric-sdk/discussions/8416 explores current bundle usage, and determines that we could save 90% if we had this sort of sharing.

warner commented 9 months ago

Some updates in the 2.5 years since we first established this trajectory:

The reduction in cost is a good thing, but it also removes a deterrent against spam and abuse. I'd prefer that we implement some sort of format-check/filtering to the installModule and installCompartmentMap handlers before we expose them. installCompartmentMap should check that the body is JSON, with only the expected keys and value shapes (we might add more in the future, but only if Endo supports them, which will require a new liveslots to hold the new Endo, which will require a chain-software upgrade to deliver the new liveslots, so I think it's safe to be strict about the shape). installModule could check that the result is well-formed UTF-8, at least, but it'd be nice if we could somehow check that it is parseable as a JS module (can we run some small portion of Endo's bundle-source tools on it at install time?). We could also limit abuse by imposing a moderate size limit to each installModule (maybe 100kB), or to apply exponential fees to large ones, to encourage authors to split up their code into smaller (and more-likely-sharable) pieces.