Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 206 forks source link

block scheduling algorithm #2319

Closed warner closed 2 years ago

warner commented 3 years ago

2299 was to impose a basic limit (any limit) on the number of cranks done in a single block, so that a simple two-vat eternal ping-pong doesn't kill the entire machine. That is easy to implement but nowhere near sophisticated enough for the real world. This ticket is about building something better.

Definition of the problem

Firstly, let's make a distinction between scheduling, escalators, and metering:

In a chain-based SwingSet machine, the host application (our cosmic-swingset layer) will do two things to the SwingSet instance it contains:

The input calls are generally triggered by signed transactions arriving from the outside world. The crank execution must happen autonomously, even when no transactions have been received (to trigger timed events, or process backlogged work).

Cosmos-SDK and Tendermint provide the "application" (cosmic-swingset, in our case) with a sequence of calls: BeginBlock, DeliverTx (repeated zero or more times), and FinishBlock. The application can do whatever it wants during each of these calls. However, it has a limited amount of time to work: if the total of BeginBlock + (sum of DeliverTxs) + FinishBlock takes too long, the block proposer will be late with its proposition, and/or the validators will be late with their vote, and the other chain nodes may think the tardy node is misbehaving (and either slash them or elect a new leader, when really they should patiently wait a bit longer). If we're using a 5-second block time, the nodes have perhaps 4 seconds to do computation, but the actual value will depend a lot on network conditions, validator coordination time, the number of staking nodes, etc.

Our initial plan is to have BeginBlock invoke the timer device (causing any timer callbacks messages to get queued), DeliverTx invoke the CapTP or IBC bridge device (causing more messages to get queued), and then process as many cranks as we can during FinishBlock. We might rearrange this to perform most of the crank work during BeginBlock, and only accept work-adding DeliverTx transactions if we can afford the additional load, following the load-shedding principle of "finish your work before committing to more". It seems unlikely, but we could even decide to do some amount of crank work during each DeliverTx, if we wanted to achieve some measure of fairness between new work and existing work. We'll have to model this under different loading situations to figure out the best approach.

We currently believe (but need to verify with Tendermint/Cosmos-SDK experts) that the basic cycle of each staking node is:

This allows transaction processing to be pipelined, which gives everybody twice as much time to work with, and doubles the utilization of the nodes, at the expense of doubling latency. It also means that the application doesn't get much say in what transactions are included in each block: Tendermint makes that decision based upon the claimed gas usage and the max-gas-per-block limit. If Tendermint decides to include a transaction, but the application later decides that it can't afford to process it, then the application's only recourse is to reject it, and hope that the sender will try again later. That is slightly more expensive/traumatic than if we could simply leave the txn in the mempool until the system got caught up enough to accept it.

Since these transactions are merely introducing cross-machine messages, not actual crank-processing work, this is a much smaller concern than the question of how many cranks to execute during each block.

How many cranks to run

Swingset executes atomic cranks, so a limited block processing time means we need to limit the number of cranks we execute in any given block. Ideally we would process cranks until the wallclock time reached the notional 4-second limit, but to maintain deterministic execution, the application is prohibited from using wallclock time.

However they decide on the number, all nodes must run the same number of cranks. We don't have to pick the number ahead of time: each node could keep running cranks until some (deterministic) value is depleted below some shared threshold, and we'd still maintain the "all nodes behave identically" property. We don't need to decide exactly which messages we will process ahead of time: we can rely upon the (deterministic) escalator algorithm to tell us what to run next, and keep looping until that other (deterministic) thing tells us to stop. Finally, it's probably ok if we run over our budget slightly: we don't need to predict the exact amount of time that a crank will do before we commit to executing it, as long as it can't take so long that validators get slashed.

Some cranks will finish in an instant. Some could take a long amount of time. We can use metering limits to put a rough bound on the time a crank will take, but it won't be measured in seconds. We may hope for a vaguely linear-ish relationship between metering units and seconds, but there will be all sorts of exceptions, and adversarial code will be able to provoke many of them.

If we get partway through a crank and then decide we need to abandon it (to meet our time budget), it may be pretty expensive to restart that crank. (This will get cheaper when we're using XS snapshots for most vats, but we're still talking about several milliseconds). It will be preferable to avoid starting the crank in the first place.

The consequence of running the wrong number of cranks

If we guess too low, we'll finish processing our selected set of work earlier than our time budget. This leaves CPU time on the table, unutilized, and leaves more work in the run-queue than it strictly needed to. The overall work rate of the chain will be lower than it could be.

If we guess too high, we'll be late to produce a block, or to approve a proposed block. If we're only a little late, it's no big deal: blocks will appear slightly slower than the target rate (perhaps once every 6 seconds instead of once every 5). If we're too late, validators will give up on the proposer, or on other validators, and eventually nodes will get slashed because it looks like they aren't doing the job they agreed to perform.

What we can react to

We're limited in what is allowed to influence the application behavior: the apphash must be a deterministic function of the previous application state and the list of transactions for each block. The only degree of freedom is in the transactions selected by the block proposer, and from what we know, this is decided by the Tendermint code, not Cosmos-SDK. If it were easier to influence this, we could maybe have the swingset code submit a special extra transaction which says how many cranks to run (based on some less-deterministic simulation or duplicate execution), which then provides a deterministic input to all other nodes.

I think the per-block gas limit is a configuration parameter that might be influenceable by validators. The corresponding mechanism in Ethereum is a proposed gas limit value in each mined block which the miner gets to choose, and some sort of averaging over the last few hundred blocks to produce a new enforced gas limit. The idea is to let this be adaptive, but not allow any one miner to commit the entire chain to doing too much work. I don't know exactly how Cosmos does this, but I wouldn't be surprised if there is room for something similar.

A vague idea

If our vat workers can return a deterministic "computrons used" count for each crank (measuring basic blocks entered, branches taken, number of objects allocated, etc), and can enforce a limit on this count (terminating computation soon after the count goes above a previously-declared value), then we can use that as a proxy for wallclock time. If this counter is vaguely proportional to time spent, then an adaptive mechanism might allow us to increase the allowed amount until we reach our target processing time and CPU utilization goals.

I'll use "computrons" to distinguish this from the "gas" unit as used by Tendermint and Cosmos-SDK. It serves the same purpose, and is defining the same kind of resource usage, but I want to avoid confusion. When Tendermint decides to include a signed transaction in a given block, that transaction has a claimed gas usage, and Tendermint is comparing that against a configured per-block total gas limit. But in our system, the execution of that transaction may trigger an entirely unrelated amount of computation, at some hard-to-predict time in the future.

The transaction might be merely acknowledging some earlier message, with nearly zero cost. Or, it might be a price update from a remote chain, which could trigger a great deal of activity that the sender should not be asked to anticipate or pay for. The signer of that transaction may have to pay some gas fees for the right to have their transaction included in a block. Those fees are unrelated to the Escalators and Meters used to decide which cranks get run and how to manage their execution costs.

So the vague idea would be:

When swingset returns control to cosmic-swingset, report the "remaining work budget" value, as well as a flag that says whether there is still work to be done on the escalators (or if they're empty). Cosmic-swingset looks at these values, the actual wallclock time elapsed, and a configured time budget, and decides whether it thinks we could afford to do more work next time, or if we need to do less. If we drained the work queue and have time left over, we should increase the limit. If we exceeded the time budget, we should decrease the limit. The amount of remaining work budget (computrons) and the amount of remaining time might fit into how much we should raise or lower the limit. Some sort of PID control-theory math might be applicable.

This (seriously non-deterministic) decision is fed into some cosmos-sdk module we write which has access to the node's signing key. The node creates a signed transaction which says "node X votes to change the computron limit to Y". Maybe we only have the block proposer emit this transaction. Maybe we only emit this transaction every N blocks, to limit the overhead.

These voting transactions are eventually collected into some block and executed. When they are executed, they're routed to a special cosmos-sdk module that merges the votes, finds some median, and decides (deterministically) on a new limit. This limit is then communicated into cosmic-swingset, which uses it for subsequent controller.run(workLimit) calls.

Maybe every vote changes the limit by a small amount, and we rely upon long-term averages. This has the same goals as the Ethereum gas limit process: no single miner should be able to commit the rest of the chain to egregious amounts of work, but there should be some way for the chain to adapt to the actual relationship between "gas" and wallclock time, to increase CPU utilization up to a comfortable margin below the block time.

(having finally written all of that up, I think it overlaps a lot with @michaelfig 's comment from the earlier ticket, copied below, with perhaps more examination of how to make it adaptive)

warner commented 3 years ago

Incidentally, I asked Zaki about how Cosmos-SDK/Tendermint chooses which txns from the mempool to include in each block. He said each block proposer has a different mempool (depending upon what it happens to have heard from the gossip protocol and when). The proposer pulls txns in FIFO order (arrival time) from that mempool, adding up each txn's claimed gas usage value, until the total gas usage reaches the per-block limit, then stops. (We weren't sure about the fencepost question: does it come in just under or just over the gas limit?).

This doesn't affect our scheduler question by much, if at all, since most txns are IO messages that just add work to the run-queue/escalators, which is basically free until the queue is so full that we should reject all txns until we get a chance to clear the backlog.

warner commented 3 years ago

In #2299, @michaelfig said:

My suggestion for a scheduling algorithm would be:

  1. Start with a block meter budget of (n + 1) * m, where m is our maximum meter per crank and n is the minimum number of cranks per block. m and n must be chosen so that the block meter budget limits the worst-case wallclock time to our tolerance.
  2. Run a crank with a vat meter of m. The vat is terminated if it consumes its entire meter.
  3. Deduct the actual meter amount that the crank took from the block budget.
  4. If the kernel run queue is nonempty and the remaining block meter budget is more than m, go back to step 2
  5. Commit the block

Originally posted by @michaelfig in https://github.com/Agoric/agoric-sdk/issues/2299#issuecomment-770268347

warner commented 3 years ago

Right, a good starting point would be to configure a static computrons-per-block value, and just figure out how to stop doing cranks early enough to not exceed that limit. If that limit is a chain-level config parameter, then maybe there's an easy-ish way for a governance vote to raise it, without any fancier automatic adaptive adjustments.

So what we need from our vat workers is:

I've added #2322 to track that task.

warner commented 3 years ago

@mhofman once we get some data from the existing testnet logs (https://github.com/Agoric/testnet-postmortem-data) and the #3107 daily perf test, we'll be looking to choose a threshold value (in computrons) that gets us a decent chance of using up most of our block budget without too high of a chance of overruns. Then the next step will be to change our main block loop to stop pulling cranks off the run-queue once we've reached this threshold.

Tartuffo commented 2 years ago

@warner After various meetings, this ended up with a MN-1 tag AND in the Product Backlog pipeline. Please modify appropriately for what it should actually be.

warner commented 2 years ago

cosmic-swingset currently provides a "run policy" object which tells the kernel to stop processing cranks after the combined computron usage grows above a particular threshold. This satisfies the needs of this ticket.

(The threshold is actually denominated in "beans", which are what the kernel uses to track costs. The run-policy is configured with a "bean price table" that defines a linear combination of computron count and vat creation, but should include more in the future, like memory allocation and message delivery size.)

The beans-per-block limit is configurable by governance, as is the beans-per-computron price. Each END_BLOCK call into the JS side of cosmic-swingset includes the currently-recorded price table, so anything that modifies that table will cause the following block to use the new thresholds. The default/initial settings are 8M computrons per block.