per-run metering limits to terminate contracts causing fast cross-vat infinite loops

What is the Problem Being Solved?

We need to protect the chain against runaway contracts. For now, we've broken the problem into three categories:

infinite loops within a single delivery
a fast cross-vat loop, where two vats ping each other as fast as possible
- in a chain environment, this would fill all blocks with messages between the two vats
- in our chain, our "drain the run-queue before injecting new host work" policy means this will never inject new host work
- cosmos transactions would continue, but no new swingset work would happen
a slow host+vat loop, where a vat does something that requires the host to react (e.g. a timer wakeup, which won't proceed until the host does timer.poll()), but then the host reaction triggers the same vat call
- in our chain environment, this will consume a few cranks in each block, forever, but it won't prevent injection of new host work

To address the second category ("fast cross-vat loops"), we want to kill the misbehaving vat when it's used too much time.

We originally designed a Meter scheme where each vat gets a limited number of computrons, is terminated if the meter underflows, and relied upon something (Zoe, acting on behalf of a contract owner) to refill the meter periodically, requiring payment in a fee token to limit the consumption it enables. We implemented all of this, but backed away from it because the "contracts as small businesses" model didn't feel mature enough, or sufficiently easy to use.

Today, @mhofman had an idea: as a short-term fix, rather than doing pay-for-usage, or even a basic computrons-per-time rate limit, we can use a computrons-per-"run" limit. Contract owners wouldn't need to do anything special, but contracts which don't finish their work within a few blocks would be terminated.

Description of the Design

First, a description of our current tools:

The Swingset vatAdminService.createVat() API accepts a meter option, taking objects created by vatAdminService.createMeter(). Each meter has a current capacity, which can be increased or decreased by the Meter holder. Any vat associated with a Meter (there can be multiple) will cause the capacity to be reduced as it executes: at the end of the delivery, however many computrons were consumed by the xsnap worker will be deducted from the Meter. If that brings the capacity below zero, the vat is terminated and the last delivery is unwound. The parent vat will observe the done promise get rejected with a "vat terminated" message. If we had meter notifiers enabled (we ripped them out during the durablization effort), the subscriber would be told that the meter was exhausted.

Swingset exposes controller.run(runPolicy) to the host application. The runPolicy basically limits execution to a maximum number of computrons: the kernel will perform deliveries until either the run-queue is empty, or the total computrons executed exceeds the policy limit. We (cosmic-swingset, in a governable parameter) picks a computron limit intended to reflect maybe 5 seconds of execution.

For every block, cosmic-swingset performs the following sequence:

call controller.run(), to try and finish any "old business" still on the run-queue from the previous block
- if that exhausts the computron budget, exit, leaving work for the next block
(label A, referenced below)
process actions from the high-priority action queue, one at a time
- for each one, submit it to the kernel, call controller.run()
- if that exhausts the computron budget, exit, leaving work for the next block
if we drain the high-priority queue before exhausting the budget, call timer.poll() (wake up timers), which might push work onto the run-queue, and then do another controller.run()
- if that exhausts the computron budget, exit, leaving work for the next block
now process actions from the low-priority action queue, one at a time
- for each one, submit it to the kernel, call controller.run()
- if that exhausts the computron budget, exit, leaving work for the next block
if we get this far, finish the block, leaving no work for the next block (the run-queue is empty)

Our proposal is to:

add a controller.resetAllMeters(computrons) API to the kernel, which resets all meters to the given limit
change cosmic-swingset to call controller.resetAllMeters(10 * perBlockComputronLimit) at the point marked label A, just after draining the run-queue of old business, just before pulling any new work from the high-priority action queue
change Zoe to create all new contracts with a Meter
- not sure what the initial capacity should be, maybe 10*perBlockComputronLimit, maybe use an UnlimitedMeter, but the capacity will be reset at the next block, so it doesn't matter too much. However, see below about changing UnlimitedMeter values after creation
- note that core/critical contracts will not get a Meter (and they'll be marked as critical), nor will Zoe or other non-contract vats
- by adding a Meter of any kind, we trigger the default per-delivery computron limit of 100M computrons, which protects against the first category of runaway: a single-vat infinite loop
- (we may need to re-evaluate the choice of 100Mc. Note that this value can be overridden by swingset, without changing xsnap.js, but it is set per-worker, not per-delivery, making it difficult to change for pre-existing vats)
- we have to change Zoe anyways to stop adding the CriticalVat flag on all contracts

With this in place, normal execution won't notice anything different: they'll reduce their meter value a little bit during execution, their meter gets reset a block later. Zoe doesn't need to do anything with the Meter after creation (zoe is not responsible for refilling it), nor does Zoe need a notification that it has fallen below a threshold (which is good because we ripped them out).

But imagine a contract that's engaged in a fast cross-vat loop with Zoe or vat-timer. For concreteness, let's say that the per-block limit is 1M computrons, and assume both vat-zoe and the contract vat are spending 100k computrons per cycle.

On the first block that involves this contract, we'll see 5 iterations of this loop before the controller.run() will hit the runPolicy's 1M per-block computron limit, leaving some messages on the run-queue, and ending the block. The contract vat will have consumed 500k computrons, so it's meter will have 10M-500k = 9.5M left. vat-zoe will have consumed the same, but it isn't metered, so there's nothing to deduct.

Now, on the next block, we'll start with the "try to finish old business" phase, and it will exhaust the runPolicy too, finishing the block just like before, leaving the contract vat with 9.0Mc left on its meter. This will go on for 20-ish blocks, at which point the contract vat's Meter will be exhausted. The contract vat will be terminated, zoe will do it's normal cleanup (returning escrowed funds). The run-queue can drain, leaving some per-block budget left, so cosmic-swingset can pull the next piece of new work from the action queue, and useful activity resumes.

If we call resetAllMeters with A times the per-block limit, and each cycle of the loop sees the contract vat use a fraction B of the total computrons involved (e.g. B = 0.5 for a loop that involves only vat-timer and the contract vat, and where both vats do equal amounts of work), then this will kill the contract vat after A / B blocks. On the other side of the scale, a contract vat which just needs a lot of time to finish its job will get at least A blocks worth of CPU without the threat of termination.

When we move to a more sophisticated metering scheme, we'll remove the resetAllMeters call, and find some other way to provide computrons to the contracts that were relying upon that source.

rejected plans

Some approaches that came up but were too complicated:

renewing only a subset of meters, to create different classes of contract vats
having zoe apply a label to each meter, to indicate which subset should be enrolled in this host-driven renewal
using a new device to let the host decide which meters to reset

Security Considerations

This increases the consumption ability of new contracts, since previously all contract vats were unmetered, so we did not enforce a per-crank delivery limit. However it also removes the critical-vat flag from new contracts vats, removing their ability to panic the kernel (and thus halt the chain).

The design described above assumes that we're no longer creation any critical/unmetered contract vats. If we want to retain that ability, we'll need some option passed into zoe.startInstance(), and some authority mechanism to gate access to it (similar to the criticalVatFlag used to enable the creation of a vat that will panic the kernel if it terminates).

Deployment Considerations

This requires changes to:

kernel (adding the controller.resetAllMeters() API)
cosmic-swingset (calling resetAllMeters())
Zoe (adding Meters to each contract vat, removing the critical-vat flag, possibly adding a new authority mechanism)

We can deploy the kernel and cosmic-swingset changes at the same time, in a chain-software upgrade (governance proposal, vote, chain halt, software upgrade, chain resumption).

To deploy the Zoe change, we'll need a core-eval proposal which uses bootstrap's retained vat-zoe vatAdminFacet to perform a .upgrade() to a new Zoe bundle (which must be pre-installed on the chain with a validateAndInstallBundle txn). We're still figuring out how to do this best: we might arrange for the chain software upgrade to inject this core-eval action as the first event after the upgrade. Note that if the run-queue was not empty before the chain upgrade, this zoe upgrade may not happen right away (especially if the chain is already in a fast-loop situation, which would require special surgery to deal with).

There are no problems if the kernel/cosmic-swingset change happens first, and the zoe change doesn't happen for a while. In that situation, cosmic-swingset would reset all meters on every block, but there wouldn't be any meters to change, so nothing special would happen.

If somehow the zoe upgrade happens first, then we'd have metered contract vats with nothing to refill their meters. These vats would run for a while, until whatever initial capacity was used up, then they would terminate. It may be best to have zoe create UnlimitedMeters, instead of meters with a fixed capacity. These contracts vats would remain unlimited until the chain-upgrade happened, after which the "unlimited" value would be reduced to the 10*block value (also, if we do that, then we don't have to hard-code a particular value into Zoe, nor add a mechanism to change it later).

Unfortunately UnlimitedMeters are not currently prepared to have their capacity changed. To fix that, we need to upgrade vat-vat-admin to change the implementation. That will require a controller.upgradeStaticVat(), which requires new code in cosmic-swingset (because core-eval, running inside a vat, cannot reach the controller object that controls the kernel itself).

So the total upgrade sequence might be:

use txn to install new zoe bundle to the chain
use txn to install new vat-vat-admin bundle to the chain
use chain-upgrade to deploy new cosmic-swingset and kernel code
- the upgrade handler instructs cosmic-swingset to call controller.upgradeStaticVat(), to change vat-vat-admin
- then it delivers a core-eval message to call E(vatInfo['zoe'].adminFacet).upgrade() with the just-installed bundle

Scaling Considerations

resetAllMeters would iterate through all kvStore keys that match m\d+, using a prefix/range-based getNextKey loop. In the long run, we would probably want a faster way to find the appropriate subset of meters to change. But, we'll probably remove this feature (in favor of something more sophisticated) before that point.

Test Plan

Kernel unit tests for controller.resetAllMeters().

I don't know how to test the upgrade handlers.

We should add a new test (not sure where it should live) of trunk code surviving a contract vat that goes into a fast cross-vat loop, e.g. after the third offer it executes.

Oh, note that we can use a regular meter (instead of an UnlimitedMeter) with a very large capacity, and it'd be almost as good as an UnlimitedMeter, and wouldn't require a vat-vat-admin upgrade. The capacity we pick just needs to be large enough to handle the startVat and contract prepare() crank (which can be pretty big).

I was thinking of using regular meters in the first place.

For the prepare call, the biggest issue is the prepare that is executed by Zoe on first start, as it's not really possible for swingset to differentiate this one and make it special. The startVat delivery in a restart/upgrade can easily be not accounted against the meter object (just the single delivery computron limit).

I looked at the mainnet chain today, and found that all contracts vats are already non-critical. @dckc pointed out that we don't even give criticalVatKey to zoe. The only critical vats are:

v1-bootstrap
v6-agoricNames
v7-board
v8-priceAuthority
v9-zoe
v10-bridge
v11-provisioning
v14-bank

Also, we're currently only launching new contracts via a CORE_EVAL proposal, which evaluates code snippets in a scope that includes access to Zoe (and optionally vatAdminService). So we could have each proposal choose an initial meter amount, create the Meter with a call to vatAdminService.createMeter(), and pass it to Zoe. That keeps the choice of initial value out of Zoe (so we don't need a new parameter/knob to adjust it).

We still need the Zoe API to be able to pass through a Meter, and we should probably hold on to the new Meter somehow so we can do something with it later (once we stop auto-refilling them).

I looked at the mainnet slogfiles to find the largest deliveries (computron count), to see if we'll have any problems with a 100Mc per-delivery limit. I found 55 deliveries with >50Mc, 45 with >90Mc, and 14 with >100Mc (the largest was 128Mc). If any metered contract vat were to spend that much in the new world this ticket creates, that vat would be terminated.

Most of these happened during the bootstrap block, but the exceptions are a startVat of a new voteCounter contract vat (not the create-vat, which evaluates ZCF, but the buildRootObject therein, which also does importBundle() of the vote counter contract). There were a dozen of these, and they all needed exactly 96_095_115c to launch.

The largest, 128_182_169c, was the startVat of the vaultFactory contract, which is probably our most complicated.

So, we need to make sure the per-delivery limit is larger than 128Mc:

add a vatKeeper.setSourceAndOptions() property to record the limit
default it to 200Mc
pull the limit from vatKeeper.getOptions() when creating the worker, pass it to options.meteringLimit() when calling xsnap()
never use the xsnap.js DEFAULT_CRANK_METERING_LIMIT
- it depends on the version of xsnap.js we're using
- we want it to be consistent for each vat
- the only time we should update it is during vat upgrade, so that reboot-time replays always use the same value
- also DEFAULT_CRANK_METERING_LIMIT is only 100Mc now anyways

Our per-block limit (the runPolicy that cosmic-swingset supplies) is 65Mc. We haven't done a calibration for a while, but I think we picked that value to target a 5-ish second block execution time. If computrons were linear, this would suggest that voteCounter contract startup could take about 10s to execute, however on my follower node, it was more like 670ms.

My plan is to augment E(vatAdminService).createMeter() from taking only (remaining, threshold) to taking (remaining, threshold, deliveryLimit). This limit will be applied when creating the vat worker (and, like all metering, is only relevant when the worker is xsnap). The caller must decide what value to use (and undefined means "do not limit each delivery", just as if there were no meter), but given the data above, I'd recommend 200Mc.

To use this argument, vat-vat-admin must be upgraded, since the old version (in swingset 0.32.2 and earlier) does not pass it through to the vat. Providing a deliveryLimit to a stale VVA will be silently ignored. But I'll set up the backwards compatibility such that the ignored argument should be the only negative consequence of a stale VVA.

I'm confused, why do we need a new argument? Wouldn't the remaining value at meter creation be the one that applies to the vat creation?

Oh, interesting. That would certainly be a smaller change: store deliveryLimit, but don't give createMeter() a way to control it, instead always setting it equal to the initial remaining value. Basically telling the vat "don't spend it all in one place".

(we need to know deliveryLimit every time we launch a worker, because it is passed as the -l $LIMIT command-line argument to xsnap, and it cannot change during a single incarnation, to be consistent when replaying the current transcript span. #2980 would let us change the limit more dynamically, and would probably add deliveryLimit to the transcript entries to achieve that consistency, but will require more interesting changes to the C code)

There's certainly no utility in setting deliveryLimit higher than the initial remaining: a long delivery which survived the -l $LIMIT would then terminate the vat a moment later once the kernel decrements the meter. But I'd assumed that callers would usually set it significantly lower than remaining, like maybe remaining / 10. or / 100.

The original idea around Meters was that they'd suffice for weeks of normal operation, and contract instance owners would get notified if remaining dropped below a threshold that gave them a few days to "top off" their Meter (and if they set their prices correctly, the incoming fees would constantly increment their Meter, and the only manual intervention would be to drain off the profit every once in a while).

But.. if meter.remaining exists to protect other contracts against an overly-busy one, deliveryLimit exists to protect the chain itself (specifically the block time and consensus dynamics) against a contract's overly-busy single delivery. So they're serving slightly different purposes, and their values represent slightly different authorities. We're ok with contract owners buying more meter.remaining (up to some limit, as we figure out fairness of scheduling). but we are not ok with them buying more deliveryLimit, at least not very much, because a long single delivery will inflate a block time and can cause slower validators to miss blocks in a way that we don't want. meter.remaining is contract vs contract, deliveryLimit is contract vs validators.

So whatever contract-owner -facing controls we're going to put on Meters, we aren't going to let them pick their own deliveryLimit. That value will be picked by something in charge of protecting the chain, the same thing choosing the runPolicy limit, which is trying to avoid more than 5-10s of execution per block.

If we were being strict, the runPolicy block limit would be the same as deliveryLimit, which would basically limit blocks to at most 2 * runPolicy (assume the last crank is issued just before runPolicy is met, and it spends its whole deliveryLimit). But the data above shows that the initial startVat for a complex contract like vaultFactory costs a lot, 128Mc, and our current runPolicy block limit is just 65Mc.

Which means we have to set deliveryLimit much higher than runPolicy just to allow contracts to start at all, and we have to tolerate contract creation inflating the block time (it doesn't happen very often, at least for now). (of course, computrons are regrettably not linear, which confuses everything)

If that were also the initial value of remaining, we'd need a pattern in which instance owners immediately added more capacity to their Meter, to give them more time to react.

Effectively, :

async function createMeter(remaining, threshold, deliveryLimit) {
  return E(vatAdminService).createMeter(remaining, threshold, deliveryLimit);
}

would be implemented as:

async function createMeter(remaining, threshold, deliveryLimit) {
  assert(remaining > deliveryLimit);
  const meter = await E(vatAdminService).createMeter(remaining, threshold);
  await E(meter).addRemaining(remaining - deliveryLimit);
  return meter;
}

In the long run, contract owners will not be creating their own Meters, they'll have a meter imposed upon them by the contract-launching service, so this pattern could be hidden by that service.

I guess it comes down to this: we could avoid an API change (and a consequential need to upgrade vat-vat-admin) by making the use pattern slightly weird. The simplest way to call it, calling just createMeter() and not immediately addRemaining, would give you a vat that would just barely survive the initial startVat. So we'd be imposing complexity upon the caller for the sake of avoiding imposing complexity upon the operator (the cost of deploying a VVA upgrade).

Not the worst tradeoff to make, but probably one to be deliberate about.

To be sure I follow, today the deliveryLimit is always DEFAULT_CRANK_METERING_LIMIT, right? There is no API to control this value. For the problem at hand, we don't need any vat side API for this limit, just something set and recorded per vat, at vat creation time, right?

Yes in a perfect world, this limit would be dynamic, so that we could have a high limit for the initial delivery (and maybe for some deliveries like BOYD), and a more reasonable limit for normal deliveries, as you say likely of the same magnitude as the per block limit. However I don't believe we need this yet.

As far as I can tell, the per delivery limit can remain predefined (but recorded in the vat options), and we don't need to introduce any changes that would require a vat vatAdmin upgrade for this feature. The only vat that should require an upgrade is Zoe to accept a meter when instantiating a vat.

Correct, in current trunk, we either tell xsnap() to use { meteringLimit: 0 } or { meteringLimit: undefined }. The former means "each delivery is unlimited", and happens when there is no MeterID associated with the vat. The latter means "use DEFAULT_CRANK_METERING_LIMIT", and happens when there is a MeterID associated with the vat. We have no way to choose a non-default limit.

The biggest win of a dynamic limit would be to set it equal to min(CRANK_METERING_LIMIT, meter.remaining), so the worker gets terminated slightly earlier (the kernel would never observe a meter being decremented into underflow, it would only observe a worker self-terminating as it hits the newly-reduced delivery limit). Not particularly important. But yeah we might be nice and allow startVat more time than usual, especially given how many computrons are being spent doing lockdown and creating liveslots, neither of which is under the control of the contract code.

The main reason I wanted the createMeter() call to decide what deliveryLimit should be is so that the kernel (and VVA) don't need to own that policy themselves. If we pick a number today, and embed it in the kernel, we need a way to adjust that parameter later, when we discover that it isn't enough, or is too much. To be sure, baking it into the kernel is slightly better than the current situation, where it's effectively baked into xsnap.js. But making each CORE_EVAL snippet choose a value seems best, and that means putting it on the Meter object, which means either 1: adding an argument to createMeter() to set it (requires an upgrade), 2: adding a method to Meter to set it (requires an upgrade and implies that you can change it after creation, which would imply a dynamic limit, which we don't have), or 3: somehow computing it from existing arguments (no upgrade, but a slightly weird API).

Having the delivery limit from from the Meter felt the most natural, since limited-vs-not-limited currently depends on the presence/absence of a meter, and applying a long-term Meter but not limiting each delivery doesn't make a lot of sense. Having it live in vatOptions is a possibility, but then we either need the kernel to pick the value, or add a property to DynamicVatOptions (via createVat() and the convertOptions() function), which means upgrades of VVA and device-vat-admin to pass the option through.

So the core options are:

1: kernel chooses the value, via some default we bake in, probably adjustable via controller.changeKernelOptions()
- does not require VVA upgrade
2: Meter contains the value, computed from initial remaining value
- does not require VVA upgrade, but API is kinda weird looking
3: Meter contains the value, passed via new (positional) deliveryLimit argument to createMeter()
- requires VVA upgrade, API looks better
4: same but deliveryLimit is optional, if not provided then compute from initial remaining
- if VVA is upgraded, userspace can use a better API
- if not, userspace can still control the value, but must do more work
- but: userspace cannot tell whether its third argument will be accepted or ignored, which is its own kind of weird

My vote is for 1) as I believe it satisfies the requirements for this issue and does not prevent switching to something like 3) in the future.

@mhofman : https://github.com/Agoric/agoric-sdk/tree/7938-reset-all-meters has a start on controller.resetAllMeters()

From my pismo replay slog (using the version of XS in pismoC + patch to treat weakref strong), I have the following delivery results:

compute crankNum vatID
106568007 1769798 v47
106568007 1769796 v46
106568007 1186 v27
106568007 1184 v26
106568007 1182 v25
106568007 1180 v24
106234480 1266 v28
98738573 821 v18
98737422 307 v15
94086567 203 v14
92650555 981 v22
92650555 979 v21
92650555 977 v20
92650555 975 v19
92650555 1752847 v45
92650555 1752845 v44
92650555 1178 v23
91707677 5774425 v64
91707677 5774247 v63
91707677 5774063 v62

So clearly 200M computrons is enough for a deliveryLimit.

I think we've settled on a plan:

the kernel tracks a deliveryLimit value somewhere in kernelKeeper
- the initial value is set by config.deliveryLimit, which defaults to undefined (which means "unlimited")
- controller.changeKernelOption({ deliveryLimit }) can change it later
each vat gets a vat.options.deliveryLimit, copied from the kernel's current value during createVat and upgradeVat, but only if vat.options.meterID indicates that the vat is metered
- if the vat is unmetered, vat.options.deliveryLimit = undefined, so the vat is allowed unlimited computrons for each delivery
- if the vat is metered, but the kernel's deliveryLimit is undefined, the deliveries will not be limited
the vat worker gets a -l LIMIT argument derived from vat.options.deliveryLimit, with suitable values to not impose a limit on deliveries if deliveryLimit === undefined
- (I think xsnap.js wants meterLimit: 0 to signal this, it's something surprising)

This means:

all existing mainnet vats will remain unmetered and unlimited
I'm not worried about vat-zoe or vat-wallet or vat-vaultFactory being terminated due to surprisingly large deliveries
the kernel does not need a hard-coded deliveryLimit: the default is unlimited
- (I want the kernel to enforce policy, but not decide it, and any guess we make for deliveryLimit will probably be wrong)

As @mhofman recommended, we can avoid tying Meters and deliveryLimit too closely: presence of a Meter will control whether a deliveryLimit is applied or not, but we aren't adding Meter.deliveryLimit or makeMeter(.., deliveryLimit), at least not now. Hopefully the right approach will become more obvious as this feature continues to mature.

The contract vat will be terminated, zoe will do it's normal cleanup (returning escrowed funds).

We really need to think whether that is sufficient. The terminated contract may have generated assets of its own, which would no longer be reachable and this become worthless. We likely need to implement #3528 to be safe against all runaway prevention actions.

Agoric / agoric-sdk