warner commented 3 years ago

2299 was to impose a basic limit (any limit) on the number of cranks done in a single block, so that a simple two-vat eternal ping-pong doesn't kill the entire machine. That is easy to implement but nowhere near sophisticated enough for the real world. This ticket is about building something better.

Definition of the problem

Firstly, let's make a distinction between scheduling, escalators, and metering:

Scheduling is about what work should be done in any given block. Given a limit on how much work can be done, which of the available messages should be processed?
We currently use a simple FIFO run-queue, so we always pick the oldest message on the queue to process. We plan to replace this with the "escalator algorithm", in which the submitters of messages can bid for priority. Instead of a single run-queue, we'll have a pool of messages, each riding a bid escalator with a particular slope and starting price. The highest bid wins the right to get executed. This gives users the ability express how important their computation is to them, by paying for execution priority.
Metering is about how long that unit of work takes, how much CPU is needed, a general measure of execution cost. We use metering for three-ish things: protect the system against computation that takes too long, compensate operators for their costs in hosting the computation, and to incentivize developers and users to make efficient use of the platform. Metering is an annoying detail of computation, and we don't want developers or users to waste a lot of energy and brainpower into dealing with it. Any system which roughly achieves those three goals will be sufficient, even if it's really coarse.
We'll probably use the same token to pay for scheduling priority as we use to pay for execution costs. We'll probably use the same "Meter" abstraction to hold those tokens. An Escalator will draw from a particular Meter, and we might deduct execution costs for the resulting delivery against that same Meter. We're still trying to design the user-visible API around selecting Meters, fallback/backup Meters, etc.

In a chain-based SwingSet machine, the host application (our cosmic-swingset layer) will do two things to the SwingSet instance it contains:

invoke device IO calls (specifically just the "I" input functions), which might update internal device state and/or add work items to the kernel's run-queue
tell the controller to execute one or more "cranks", each of which removes an item of work from the run-queue and executes it (by delivering a message to a vat, or notifying a vat that a promise has been resolved)
- executing a crank will update internal vat state and/or adds more work to the run-queue

The input calls are generally triggered by signed transactions arriving from the outside world. The crank execution must happen autonomously, even when no transactions have been received (to trigger timed events, or process backlogged work).

Cosmos-SDK and Tendermint provide the "application" (cosmic-swingset, in our case) with a sequence of calls: BeginBlock, DeliverTx (repeated zero or more times), and FinishBlock. The application can do whatever it wants during each of these calls. However, it has a limited amount of time to work: if the total of BeginBlock + (sum of DeliverTxs) + FinishBlock takes too long, the block proposer will be late with its proposition, and/or the validators will be late with their vote, and the other chain nodes may think the tardy node is misbehaving (and either slash them or elect a new leader, when really they should patiently wait a bit longer). If we're using a 5-second block time, the nodes have perhaps 4 seconds to do computation, but the actual value will depend a lot on network conditions, validator coordination time, the number of staking nodes, etc.

Our initial plan is to have BeginBlock invoke the timer device (causing any timer callbacks messages to get queued), DeliverTx invoke the CapTP or IBC bridge device (causing more messages to get queued), and then process as many cranks as we can during FinishBlock. We might rearrange this to perform most of the crank work during BeginBlock, and only accept work-adding DeliverTx transactions if we can afford the additional load, following the load-shedding principle of "finish your work before committing to more". It seems unlikely, but we could even decide to do some amount of crank work during each DeliverTx, if we wanted to achieve some measure of fairness between new work and existing work. We'll have to model this under different loading situations to figure out the best approach.

We currently believe (but need to verify with Tendermint/Cosmos-SDK experts) that the basic cycle of each staking node is:

receive block N from current block proposer
- block N contains AppHash[N] and a list of Transactions[N+1]
compare received AppHash[N] against locally-known MyAppHash[N], perform other checks
- if approved, perform pre-vote phase, perform vote phase
- when enough "yes" votes arrive, block is declared final
call application's BeginBlock
feed all of Transactions[N+1] into application's DeliverTx
call application's FinishBlock, get back a new MyAppHash[N+1] for the result
if this node is the block proposer for the new round:
- choose the first X txns from the local mempool such that the sum of their claimed gas usage is less than the configured maximum gas-per-block, call this list Transactions[N+2]
- generate a block N+1 that includes MyAppHash[N+1] and Transactions[N+2]
- propose that block
else if the node is not the proposer:
- wait for block N+1 from block proposer
repeat

This allows transaction processing to be pipelined, which gives everybody twice as much time to work with, and doubles the utilization of the nodes, at the expense of doubling latency. It also means that the application doesn't get much say in what transactions are included in each block: Tendermint makes that decision based upon the claimed gas usage and the max-gas-per-block limit. If Tendermint decides to include a transaction, but the application later decides that it can't afford to process it, then the application's only recourse is to reject it, and hope that the sender will try again later. That is slightly more expensive/traumatic than if we could simply leave the txn in the mempool until the system got caught up enough to accept it.

Since these transactions are merely introducing cross-machine messages, not actual crank-processing work, this is a much smaller concern than the question of how many cranks to execute during each block.

How many cranks to run

Swingset executes atomic cranks, so a limited block processing time means we need to limit the number of cranks we execute in any given block. Ideally we would process cranks until the wallclock time reached the notional 4-second limit, but to maintain deterministic execution, the application is prohibited from using wallclock time.

However they decide on the number, all nodes must run the same number of cranks. We don't have to pick the number ahead of time: each node could keep running cranks until some (deterministic) value is depleted below some shared threshold, and we'd still maintain the "all nodes behave identically" property. We don't need to decide exactly which messages we will process ahead of time: we can rely upon the (deterministic) escalator algorithm to tell us what to run next, and keep looping until that other (deterministic) thing tells us to stop. Finally, it's probably ok if we run over our budget slightly: we don't need to predict the exact amount of time that a crank will do before we commit to executing it, as long as it can't take so long that validators get slashed.

Some cranks will finish in an instant. Some could take a long amount of time. We can use metering limits to put a rough bound on the time a crank will take, but it won't be measured in seconds. We may hope for a vaguely linear-ish relationship between metering units and seconds, but there will be all sorts of exceptions, and adversarial code will be able to provoke many of them.

If we get partway through a crank and then decide we need to abandon it (to meet our time budget), it may be pretty expensive to restart that crank. (This will get cheaper when we're using XS snapshots for most vats, but we're still talking about several milliseconds). It will be preferable to avoid starting the crank in the first place.

The consequence of running the wrong number of cranks

If we guess too low, we'll finish processing our selected set of work earlier than our time budget. This leaves CPU time on the table, unutilized, and leaves more work in the run-queue than it strictly needed to. The overall work rate of the chain will be lower than it could be.

If we guess too high, we'll be late to produce a block, or to approve a proposed block. If we're only a little late, it's no big deal: blocks will appear slightly slower than the target rate (perhaps once every 6 seconds instead of once every 5). If we're too late, validators will give up on the proposer, or on other validators, and eventually nodes will get slashed because it looks like they aren't doing the job they agreed to perform.

What we can react to

We're limited in what is allowed to influence the application behavior: the apphash must be a deterministic function of the previous application state and the list of transactions for each block. The only degree of freedom is in the transactions selected by the block proposer, and from what we know, this is decided by the Tendermint code, not Cosmos-SDK. If it were easier to influence this, we could maybe have the swingset code submit a special extra transaction which says how many cranks to run (based on some less-deterministic simulation or duplicate execution), which then provides a deterministic input to all other nodes.

I think the per-block gas limit is a configuration parameter that might be influenceable by validators. The corresponding mechanism in Ethereum is a proposed gas limit value in each mined block which the miner gets to choose, and some sort of averaging over the last few hundred blocks to produce a new enforced gas limit. The idea is to let this be adaptive, but not allow any one miner to commit the entire chain to doing too much work. I don't know exactly how Cosmos does this, but I wouldn't be surprised if there is room for something similar.

A vague idea

If our vat workers can return a deterministic "computrons used" count for each crank (measuring basic blocks entered, branches taken, number of objects allocated, etc), and can enforce a limit on this count (terminating computation soon after the count goes above a previously-declared value), then we can use that as a proxy for wallclock time. If this counter is vaguely proportional to time spent, then an adaptive mechanism might allow us to increase the allowed amount until we reach our target processing time and CPU utilization goals.

I'll use "computrons" to distinguish this from the "gas" unit as used by Tendermint and Cosmos-SDK. It serves the same purpose, and is defining the same kind of resource usage, but I want to avoid confusion. When Tendermint decides to include a signed transaction in a given block, that transaction has a claimed gas usage, and Tendermint is comparing that against a configured per-block total gas limit. But in our system, the execution of that transaction may trigger an entirely unrelated amount of computation, at some hard-to-predict time in the future.

The transaction might be merely acknowledging some earlier message, with nearly zero cost. Or, it might be a price update from a remote chain, which could trigger a great deal of activity that the sender should not be asked to anticipate or pay for. The signer of that transaction may have to pay some gas fees for the right to have their transaction included in a block. Those fees are unrelated to the Escalators and Meters used to decide which cranks get run and how to manage their execution costs.

So the vague idea would be:

cosmic-swingset tracks a work limit, denominated in computrons
for each block, during FinishBlock, cosmic-swingset calls controller.run(workLimit)
the swingset controller initializes a "remaining work budget" value to workLimit
the escalators are consulted, a candidate message is selected, its metering limit is checked
if the message's maximum work amount (the value beyond which the crank will be terminated) is more than the remaining budget, end the block and return to cosmic-swingset
else, perform the crank, and get back an actual-work-consumed value
deduct the actual-consumed from the remaining work budget
repeat

When swingset returns control to cosmic-swingset, report the "remaining work budget" value, as well as a flag that says whether there is still work to be done on the escalators (or if they're empty). Cosmic-swingset looks at these values, the actual wallclock time elapsed, and a configured time budget, and decides whether it thinks we could afford to do more work next time, or if we need to do less. If we drained the work queue and have time left over, we should increase the limit. If we exceeded the time budget, we should decrease the limit. The amount of remaining work budget (computrons) and the amount of remaining time might fit into how much we should raise or lower the limit. Some sort of PID control-theory math might be applicable.

This (seriously non-deterministic) decision is fed into some cosmos-sdk module we write which has access to the node's signing key. The node creates a signed transaction which says "node X votes to change the computron limit to Y". Maybe we only have the block proposer emit this transaction. Maybe we only emit this transaction every N blocks, to limit the overhead.

These voting transactions are eventually collected into some block and executed. When they are executed, they're routed to a special cosmos-sdk module that merges the votes, finds some median, and decides (deterministically) on a new limit. This limit is then communicated into cosmic-swingset, which uses it for subsequent controller.run(workLimit) calls.

Maybe every vote changes the limit by a small amount, and we rely upon long-term averages. This has the same goals as the Ethereum gas limit process: no single miner should be able to commit the rest of the chain to egregious amounts of work, but there should be some way for the chain to adapt to the actual relationship between "gas" and wallclock time, to increase CPU utilization up to a comfortable margin below the block time.

(having finally written all of that up, I think it overlaps a lot with @michaelfig 's comment from the earlier ticket, copied below, with perhaps more examination of how to make it adaptive)

warner commented 3 years ago

Incidentally, I asked Zaki about how Cosmos-SDK/Tendermint chooses which txns from the mempool to include in each block. He said each block proposer has a different mempool (depending upon what it happens to have heard from the gossip protocol and when). The proposer pulls txns in FIFO order (arrival time) from that mempool, adding up each txn's claimed gas usage value, until the total gas usage reaches the per-block limit, then stops. (We weren't sure about the fencepost question: does it come in just under or just over the gas limit?).

This doesn't affect our scheduler question by much, if at all, since most txns are IO messages that just add work to the run-queue/escalators, which is basically free until the queue is so full that we should reject all txns until we get a chance to clear the backlog.

warner commented 3 years ago

In #2299, @michaelfig said:

My suggestion for a scheduling algorithm would be:

Start with a block meter budget of (n + 1) * m, where m is our maximum meter per crank and n is the minimum number of cranks per block. m and n must be chosen so that the block meter budget limits the worst-case wallclock time to our tolerance.
Run a crank with a vat meter of m. The vat is terminated if it consumes its entire meter.
Deduct the actual meter amount that the crank took from the block budget.
If the kernel run queue is nonempty and the remaining block meter budget is more than m, go back to step 2
Commit the block

Originally posted by @michaelfig in https://github.com/Agoric/agoric-sdk/issues/2299#issuecomment-770268347

warner commented 3 years ago

Right, a good starting point would be to configure a static computrons-per-block value, and just figure out how to stop doing cranks early enough to not exceed that limit. If that limit is a chain-level config parameter, then maybe there's an easy-ish way for a governance vote to raise it, without any fancier automatic adaptive adjustments.

So what we need from our vat workers is:

an enforced limit on the usage per crank
return how much was actually used, so we can accumulate it towards the block budget

I've added #2322 to track that task.

warner commented 3 years ago

@mhofman once we get some data from the existing testnet logs (https://github.com/Agoric/testnet-postmortem-data) and the #3107 daily perf test, we'll be looking to choose a threshold value (in computrons) that gets us a decent chance of using up most of our block budget without too high of a chance of overruns. Then the next step will be to change our main block loop to stop pulling cranks off the run-queue once we've reached this threshold.

Tartuffo commented 2 years ago

@warner After various meetings, this ended up with a MN-1 tag AND in the Product Backlog pipeline. Please modify appropriately for what it should actually be.

warner commented 2 years ago

cosmic-swingset currently provides a "run policy" object which tells the kernel to stop processing cranks after the combined computron usage grows above a particular threshold. This satisfies the needs of this ticket.

(The threshold is actually denominated in "beans", which are what the kernel uses to track costs. The run-policy is configured with a "bean price table" that defines a linear combination of computron count and vat creation, but should include more in the future, like memory allocation and message delivery size.)

The beans-per-block limit is configurable by governance, as is the beans-per-computron price. Each END_BLOCK call into the JS side of cosmic-swingset includes the currently-recorded price table, so anything that modifies that table will cause the following block to use the new thresholds. The default/initial settings are 8M computrons per block.

Agoric / agoric-sdk

block scheduling algorithm #2319

2299 was to impose a basic limit (any limit) on the number of cranks done in a single block, so that a simple two-vat eternal ping-pong doesn't kill the entire machine. That is easy to implement but nowhere near sophisticated enough for the real world. This ticket is about building something better.

Definition of the problem

How many cranks to run

The consequence of running the wrong number of cranks

What we can react to

A vague idea