Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

tools to pause chain/kernel/vats before security upgrades #4536

Open warner opened 2 years ago

warner commented 2 years ago

What is the Problem Being Solved?

Imagine we (Agoric) have just received disclosure of a significant security bug in some component of the running chain. How can we safely deploy a fix, without giving attackers time to exploit the problem?

The vulnerability might already be known to the attackers, and they've just been waiting for it to become worth exploting (e.g. waiting for a liquidity pool to grow to a juicy size), so they may execute their attack as soon as they see/suspect a fix coming. Or they don't already know the problem, but can reverse-engineer it from the fix, and then perform the attack before the fix is fully deployed.

The core issue is the non-zero times between the defender's sequence (learning about a problem, fixing it, deploying the fix) and the attacker's sequence (learning about a problem, developing an exploit, executing the exploit). This problem exists in distributed systems of all shapes and sizes, but it's particularly exciting for decentralized systems, where there is no one party with the authority to make a change. The fix may involve changing some parameter within a contract, or upgrading the contract, or upgrading the entire chain. To deploy the fix, some will require transactions sent into the chain (which must make their way through various public queues before execution, giving attackers an opportuinity to front-run them, or MEV threats from observant validators). Deeper fixes require coordinating the validator community to upgrade their software. And both kinds of fixes might be telegraphed by commits to an open-source code repository before they are ready to be deployed. Both of these reveal significant information to the attackers, who may then be able to act before the fix is fully implemented.

A powerful tool to address this is the "snooze button". A small group can have the power to pause some or all of the chain's activity, giving a larger group time to develop and deploy a fix. Then, after the fix is deployed, the chain is resumed. The pause event can reveal the existence of a problem, but not the details, reducing the attacker's advantage. Only the attacker who already knew about a problem and was ready to execute their attack (and can race ahead) can react to the pause event.

Once paused, the defenders can work on the fix in public, or at least they can safely involve a larger group to test the fix and coordinate deployment. This reveals the details to the attackers, but by that point it is too late for them to exploit.

Users of our system care about liveness: knowing that their transactions can't be blocked forever (at least not without the approval of some larger governance committee). They care that this "snooze button" has a limited duration, perhaps a few days or a few weeks. But we can imagine various "sizes" of snooze buttons, with larger governance requirements over the longer-duration delays.

Categories of Attack, Categories of Fixes

We're imagining problems that affect components at various scales:

We also imagine fine-grained contract pauses, in which the contract consults a table of what activity should and should not be allowed at any given moment. The contract might reject method invocations when paused, or it might queues them internally. We can imagine contracts registering to hear about updates to the "emergency pause table", via high-priority update messages. In this approach:

A similar "pause table" could be used at the Cosmos-SDK level, between Go modules, without using the bridge device.

Most of these pauses would be initiated by a Cosmos-SDK module, which reacts to a quorum of signed transactions from a small "security committee". This module would then change parameters, send bridge-device updates, and tell the Swingset module how/whether to interact with the kernel. For example, the Swingset module currently calls the swingset controller.run(runPolicy) method during END_BLOCK to perform a bounded amount of work (pulling from all queues in priority order). If the pause type was "stop servicing low-priority queues", this module would be instructed to instead to controller.run(runPolicy, { onlyServiceQueue: 'high'}) or similar. Timer and mailbox events would still be pushed onto the run-queue, but the low-priority consequences would not happen until the setting was changed.

To maintain liveness, each of these pauses needs to be clearly time-bounded. The Cosmos-SDK module that receives the security committee txn needs to watch the block height and unpause everything when the pause expires. Additional votes (with a larger quorum requirement) might extend the pause if more time is necessary to develop/test/deploy the fix.

Disclosure Timeline

We imagine a sequence like the following:

If it looks like the pause window won't be enough, a larger security committee might have the authority to extend it. We'll need the pause events to have IDs so the txn that extends it can be easily matched to what is being extended.

The pause event should probably include a CVE or URL to a place where details can be found. The details should be withheld until the fix is deployed.

Subcomponents

Related

jessysaurusrex commented 2 years ago

There's one process step to capture that will be important for this, especially if the security team chooses to work on an issue in the open while a "snooze" is in effect.

As part of our triage, we will need to ensure that we do a tertiary analysis before creating a work item that makes details of an issue public, and that analysis should comprehensively examine whether or not said bug exists in other places in the stack. By including this in triage, before any details or work take place in public, we can reduce the risk of exploitation in other areas the code.

Tartuffo commented 2 years ago

Need a few more sub issues

Tartuffo commented 2 years ago

@warner @jessysaurusrex I made this an epic. Can you the two of you please coordinate on creating the appropriate sub-issues?

Tartuffo commented 2 years ago

Bump @jessysaurusrex do you need anything to help get this moving?

warner commented 2 years ago

@Chris-Hibbert added a feature to allow each contract instance to ask Zoe to block the delivery of specific messages (identified as a list of strings). The economic committee can use a multisig message to instruct the main Inter-protocol contracts (through their governance facets) to use this facility.

So our current response plan for the "bug discovered in Inter contract XYZ that allows funds to be stolen" scenario is:

That provides the most narrow tool we expect to have available: it can block a single message to a single contract instance, or (with a broad enough multisig message) block all messages on all core contract instances.

The most coarse tool is that we talk to a lot of validators and ask them all to pause the chain, by just turning off 1/3rd of the validators at the same time. This would be the same path we'd take to get an emergency sotware upgrade through, where we couldn't afford to wait for an on-chain vote (perhaps because the chain is already halted, and therefore cannot assist with the voting process).

We think that, between these two tools, we don't need an intermediate tool right away. That intermediate tool would e.g. pause a specific vat entirely (queue all messages, rather than block/reject specific ones), or maybe pause the kernel while allowing the rest of cosmos to keep running. Such a tool would be nice to have, but many of the scenarios in which we'd want to use it are drastic enough that simply stopping the entire chain might be just as good.

So we're pushing this out of MN-1.

Chris-Hibbert commented 2 years ago

The Zoe feature allows the contract to block exercise of a subset of invitation, identified by their description strings. It doesn't block delivery of arbitrary messages to the contract.