metering support in vat workers (XS regex, ...)

warner commented 3 years ago

What is the Problem Being Solved?

As discussed in #2319, to figure out when we should end a block (ideally just before we reach the wallclock time budget), we need our vat workers to provide two facilities to the kernel:

when the kernel instructs a worker to execute a crank, the kernel will provide a limit (a number)
- as soon as the worker sees that the computation has exceeded this limit, the vat should be killed
- the worker should return a VatDeliveryResults object that indicates the vat died due to a meter overrun
- the kernel might later rebuild this vat from an earlier state (to resubmit the crank), or it might just be dead forever
- the kernel will generally provide a fairly generous limit: we really want to avoid retries, because they're expensive
when the worker finishes executing a crank, it will return the actual amount consumed (as a number)
- the VatDeliveryResults will include this number

The kernel will be able to add this number to the usage of all other cranks it has executed, and compare the result against a per-block limit.

We're hoping that this number will be roughly proportional to wallclock time. It is, of course, influenced by all sorts of other factors (CPU speed, complexity of the code being run, overhead, interference by other processes on that system, etc). We don't require the relationship to be deterministic, however the more stable/consistent it is, the higher we can raise the limit (improving utilization) while still avoiding occasional block timeouts on most validators.

We do require that the limits and amount-consumed results are deterministic. All workers (on all nodes) must yield the same values for a vat in the same state, processing the same crank. Any overhead that is not part of the deterministic input to the vat behavior must be excluded from the usage numbers.

Design

The VatDeliveryObject should be expanded to include a meter limit. The VatDeliveryResults should be expanded to include a consumption amount.

For now, the kernel should choose a hard-coded meter limit based upon the worker type. Workers that do not enforce metering (which is all of them right now) should ignore it. Workers which do not measure usage should return a consumption of undefined.

We'll need to think about the relationship between the kernel-process -side VatManager and whatever worker it manages, to figure out the right place to decide/deduct the usage numbers. This will eventually expand to involve Meters, stacked/backup meters, etc, all of which happens much closer to the kernel than to the worker.

Worker Types

Our main target is XS, where we expect the JS engine to support this feature. The xsnap supervisor will need to be told what the limit is, configure it into the engine somehow, run the crank to completion (or early termination), determine which case actually happened, retrieve the consumed (or remaining) value from the engine, and report the results back to the kernel.

For a local (Node.js in the kernel process) worker, we might adapt the existing Metering Transform to give us a similar number, and augment the local worker code to have a similar API. The actual units would be drastically different, of course (the same vat running in a local worker will consume different "gas" than if it were running on XS), but the basic principle would be the same. Or we might decide that it isn't worth the effort to try and limit local workers and put our energy solely into XS.

Security Considerations

If the metering is non-deterministic (slightly different versions of XS, perhaps), one validator might decide that it is done with a block earlier than others. This might be deterministically different: they get the same answer each time, just not an answer that matches the other validators. I'm not sure how Tendermint would react to this, but I think it could be bad: the minority faction could never replicate the proposed block, and would remain disjoint forever. They would eventually get slashed for not contributing, and the total validator set would grow smaller.

If the metering is random (e.g. it winds up depending upon uninitialized memory, or filesystem responses, or some platform-level thing that changes from one run to the next), it's probably worse: a validator that re-runs the disputed block could produce a different vote than before, which is equivocation and subject to more serious slashing. Validators who find themselves disagreeing with the proposed block because of a metering difference might wish to refrain from voting, and instead try restarting the process, to re-run the block from a different platform state, in the hopes of achieving a match (i.e. compare the number of cranks performed locally against something in the block, before deciding to vote no).

Small differences in the measured meter value might not cause the number of executed cranks to change: the probability depends upon the average crank size vs the amount of variation. The statistics would be similar to what you get by rounding a slightly random number and comparing the integer result across trials. This might reduce the effect of nondeterministic metering. If validators had a way to survive a low rate of disagreement (perhaps by taking a performance hit as they re-run the process and rebuild all their vats), this might still work.

Hmm, if we had such a technique to survive disagreement, we might want to intentionally introduce a bit of randomness, to avoid getting stuck in a "no matter how many times I retry, I keep getting the same (minority) result" hole. Curious.

Test Plan

We need test cases that populate a medium-sized vat, send it instructions to run a bunch of cranks, and then compare how many computrons each one used against a golden master. (We have no theory as to how to predict this value; we must simply run it once to find out the number, and then assert that all subsequent trials yield the same result).

cc @michaelfig @dckc @erights

dckc commented 3 years ago

how crazy is it to think we could use XS on cosmwasm for metering? https://github.com/Agoric/agoric-sdk/issues/2224#issuecomment-817320580

dckc commented 3 years ago

@michaelfig wrote up a proposal based on discussion with @phoddie .

Elaborating on the test plan, I gather we should test:

JSON.parse, stringify
regex matching (esp. REDOS exponential explosion)
Array.forEach, Map.forEach, Set.forEach, Array.map
- builtins calling builtins: [1,2,3].forEach(JSON.stringify)

I wonder if we should split this into smaller issue. Perhaps engine-level metering should be its own issue, separate from swingset?

for reference: @warner sent out an Agoric's Metering Needs for XS document March 5.

dckc commented 3 years ago

brainstorming more things to test with @michaelfig , prompted by May 8 XS metering updates ...

new Array(1e7) vs. new Array(1e7).forEach(Object.create) JSON.stringify(new Array(1e20))

long string compare more expensive than short string compare?

stringify circular data:

> a = []
[]
> a.push(a)
1
> JSON.stringify(a)
Uncaught TypeError: Converting circular structure to JSON
--> starting at object with constructor 'Array'
--- index 0 closes the circle
at JSON.stringify (<anonymous>)

TOCTOU: proxy vs. array... length changes

dckc commented 3 years ago

XS metering has a limit parameter; we have it set to 1, i.e. check every time whether the meter has tripped. I wonder what the performance impact is and whether we should set it to something higher.

escalated to #3381

note from a May 13 discussion with warner

Agoric / agoric-sdk