tools to pause chain/kernel/vats before security upgrades

warner commented 2 years ago

What is the Problem Being Solved?

Imagine we (Agoric) have just received disclosure of a significant security bug in some component of the running chain. How can we safely deploy a fix, without giving attackers time to exploit the problem?

The vulnerability might already be known to the attackers, and they've just been waiting for it to become worth exploting (e.g. waiting for a liquidity pool to grow to a juicy size), so they may execute their attack as soon as they see/suspect a fix coming. Or they don't already know the problem, but can reverse-engineer it from the fix, and then perform the attack before the fix is fully deployed.

The core issue is the non-zero times between the defender's sequence (learning about a problem, fixing it, deploying the fix) and the attacker's sequence (learning about a problem, developing an exploit, executing the exploit). This problem exists in distributed systems of all shapes and sizes, but it's particularly exciting for decentralized systems, where there is no one party with the authority to make a change. The fix may involve changing some parameter within a contract, or upgrading the contract, or upgrading the entire chain. To deploy the fix, some will require transactions sent into the chain (which must make their way through various public queues before execution, giving attackers an opportuinity to front-run them, or MEV threats from observant validators). Deeper fixes require coordinating the validator community to upgrade their software. And both kinds of fixes might be telegraphed by commits to an open-source code repository before they are ready to be deployed. Both of these reveal significant information to the attackers, who may then be able to act before the fix is fully implemented.

A powerful tool to address this is the "snooze button". A small group can have the power to pause some or all of the chain's activity, giving a larger group time to develop and deploy a fix. Then, after the fix is deployed, the chain is resumed. The pause event can reveal the existence of a problem, but not the details, reducing the attacker's advantage. Only the attacker who already knew about a problem and was ready to execute their attack (and can race ahead) can react to the pause event.

Once paused, the defenders can work on the fix in public, or at least they can safely involve a larger group to test the fix and coordinate deployment. This reveals the details to the attackers, but by that point it is too late for them to exploit.

Users of our system care about liveness: knowing that their transactions can't be blocked forever (at least not without the approval of some larger governance committee). They care that this "snooze button" has a limited duration, perhaps a few days or a few weeks. But we can imagine various "sizes" of snooze buttons, with larger governance requirements over the longer-duration delays.

Categories of Attack, Categories of Fixes

We're imagining problems that affect components at various scales:

a single contract has a problem, which could be addressed by changing some parameter
- Pause: pause the contract vat, causing all inbound messages to be queued off to the side
- Fix: allow a high-priority non-paused message to change the parameter
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
a single contract has a problem, which requires a complete vat/contract upgrade
- Pause: pause the contract vat, queue all inbound messages off to the side
- Fix: perform an upgrade of the vat (#3272)
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
a collection of contracts have a problem
- Pause: the kernel stops servicing the low-priority queues (#3465), but allows high-priority messages so e.g. liquidation continues but new vault creation is paused
- Fix: vat upgrade, parameter change
- Resume: the kernel resumes servicing the low-priority queues
the entire swingset kernel has a problem
- Pause: the kernel stops servicing all queues
- Fix: the kernel is upgraded
- Resume: the kernel resumes servicing all queues
one or more Cosmos-SDK modules have a problem
- Pause: a governance/emergency-pause module tells those modules to reject all txns
- Fix: a governance module modifies some parameter, or the validation software is upgraded
- Resume: the governance/emergency-pause module tells those modules to start accepting txns again

We also imagine fine-grained contract pauses, in which the contract consults a table of what activity should and should not be allowed at any given moment. The contract might reject method invocations when paused, or it might queues them internally. We can imagine contracts registering to hear about updates to the "emergency pause table", via high-priority update messages. In this approach:

Pause: use the bridge-device mechanism to send an update, wait for it to be delivered to the contract vat
Fix: send a message to the contract to change a parameter, or perhaps upgrade the vat entirely
Resume: update the table, wait for the vat to hear about the update

A similar "pause table" could be used at the Cosmos-SDK level, between Go modules, without using the bridge device.

Most of these pauses would be initiated by a Cosmos-SDK module, which reacts to a quorum of signed transactions from a small "security committee". This module would then change parameters, send bridge-device updates, and tell the Swingset module how/whether to interact with the kernel. For example, the Swingset module currently calls the swingset controller.run(runPolicy) method during END_BLOCK to perform a bounded amount of work (pulling from all queues in priority order). If the pause type was "stop servicing low-priority queues", this module would be instructed to instead to controller.run(runPolicy, { onlyServiceQueue: 'high'}) or similar. Timer and mailbox events would still be pushed onto the run-queue, but the low-priority consequences would not happen until the setting was changed.

To maintain liveness, each of these pauses needs to be clearly time-bounded. The Cosmos-SDK module that receives the security committee txn needs to watch the block height and unpause everything when the pause expires. Additional votes (with a larger quorum requirement) might extend the pause if more time is necessary to develop/test/deploy the fix.

Disclosure Timeline

We imagine a sequence like the following:

security researcher notifies a member of the security team about a potential problem
security team quietly investigates, concludes the problem is severe enough to warrant the snooze button
security committee is quietly informed, convinced to snooze, signs the txn, submits the txn
- prepared attacker learns about the upcoming pause, might try to race ahead and deploy attack
- all attackers become aware of the service that is vulnerable, but not the nature of the vuln
pause txn gets accepted into a block, activity is now paused
- prepared attacker's race window ends
security team develops the fix
- might reveal the details by involving more people
- might reveal the details by publishing a fix to version control
security team tests the fix
security team publishes the fix
- definitely reveals the details
for fixes that replace validator software:
- validators examine/consider/test the fix
- somebody submits a governance vote to implement the fix
- vote passes
- validators upgrade software, restart
- activation block height arrives, fix deployed
for fixes that don't
- governance/upgrade committee submits the fix txn to the chain
- txn gets accepted into block, executed
- fix deployed
security committee decides fix is deployed, creates/signs the unpause txn, submits txn
unpause txn is accepted into a block, executed
activity resumes

If it looks like the pause window won't be enough, a larger security committee might have the authority to extend it. We'll need the pause events to have IDs so the txn that extends it can be easily matched to what is being extended.

The pause event should probably include a CVE or URL to a place where details can be found. The details should be withheld until the fix is deployed.

Subcomponents

swingset `controller.run("but only the high-priority queue")
swingset controller.pauseVats(vatIDs), unpause
a pattern for contracts to register for pause events, like they do with governance
a Cosmos-SDK module to receive the security committee txns and execute pause/unpause
a pattern for Cosmos-SDK modules to check the pause table and reject txns when disabled

As part of our triage, we will need to ensure that we do a tertiary analysis before creating a work item that makes details of an issue public, and that analysis should comprehensively examine whether or not said bug exists in other places in the stack. By including this in triage, before any details or work take place in public, we can reduce the risk of exploitation in other areas the code.

Tartuffo commented 2 years ago

Need a few more sub issues

Tartuffo commented 2 years ago

@warner @jessysaurusrex I made this an epic. Can you the two of you please coordinate on creating the appropriate sub-issues?

Tartuffo commented 2 years ago

Bump @jessysaurusrex do you need anything to help get this moving?

warner commented 2 years ago

@Chris-Hibbert added a feature to allow each contract instance to ask Zoe to block the delivery of specific messages (identified as a list of strings). The economic committee can use a multisig message to instruct the main Inter-protocol contracts (through their governance facets) to use this facility.

So our current response plan for the "bug discovered in Inter contract XYZ that allows funds to be stolen" scenario is:

quietly analyze problem enough to identify a subset of contract messages that are vulnerable
quietly explain to economic committee, ask them to sign the pause directive
committee members apply their individual signatures
pause directive is published to the chain
- this makes the attack surface of the bug visible
- from the time an attacker observes the submitted directive, to the time it gets executed in a block, the contract is vulnerable to someone who can quickly reverse-engineer the bug from the list of methods that would be needed to exploit it
once executed, we can discuss the problem more publically
a fix is developed, probably require a contract vat upgrade
the same economic committee exercises their zoe/contract-governance authority to perform the vat upgrade
once upgraded, the same committee uses a multisig message to instruct the contract to tell Zoe to remove the message block

That provides the most narrow tool we expect to have available: it can block a single message to a single contract instance, or (with a broad enough multisig message) block all messages on all core contract instances.

The most coarse tool is that we talk to a lot of validators and ask them all to pause the chain, by just turning off 1/3rd of the validators at the same time. This would be the same path we'd take to get an emergency sotware upgrade through, where we couldn't afford to wait for an on-chain vote (perhaps because the chain is already halted, and therefore cannot assist with the voting process).

We think that, between these two tools, we don't need an intermediate tool right away. That intermediate tool would e.g. pause a specific vat entirely (queue all messages, rather than block/reject specific ones), or maybe pause the kernel while allowing the rest of cosmos to keep running. Such a tool would be nice to have, but many of the scenarios in which we'd want to use it are drastic enough that simply stopping the entire chain might be just as good.

So we're pushing this out of MN-1.

Chris-Hibbert commented 2 years ago

The Zoe feature allows the contract to block exercise of a subset of invitation, identified by their description strings. It doesn't block delivery of arbitrary messages to the contract.

Agoric / agoric-sdk