Kill switch: Guaranteeing a full recoverability process

When we talk about kill-switch recoverability we refer to allow DAOs to restore from real bad situations, and probably one of the best examples to think of is switching off the voting base contract, while having a voting app instance as the root authority of a DAO.

That said, we've been exploring the following edge cases to understand a robust and decentralized solution to this problem:

1. Switching off the `IssuesRegistry`

This should be allowed and we actually can handle this scenario from the kill switch itself. We will need to decide what to do if a DAO is trusting an issues registry that has been kill-switched off.

2. Switching off the Voting app

Even though we could provide a backdoor method to allow creating a vote to perform an upgrade, if the voting contract is broken, we cannot guarantee that particular vote wont be flawless. If the voting app is broken, we will need to bypass it completely. Thus, we need to allow DAOs to perform an upgrade without having to go through the voting app. That said, if we still want to have a decentralized process to make that decision, it will be a smart contract which we will have to trust. Then, it should be the thinnest, simplest, and most minimal implementation we can provide (we can think it as an analogy of a proxy to the upgradeability solution).

3. Switching off the ACL

As we mentioned before for the voting app, if the whole ACL contract is broken, we cannot even trust the permissions query functions to check if a caller is allowed to perform an action like an upgrade. Then, we will need to either trust the ACL to use its authorization flow upon a given upgrade call, or to bypass it at all, building again a new custom and simple logic to do it.

4. Switching off the Kernel

If nothing could be trusted, there's nothing we can do. We need at least to trust the upgrade process, otherwise there’s no way out to solve a critical bug on a contract. But since the Voting app or the ACL could be broken, we should have a separate upgrade method that could be called only on emergency^(*). Note that all the methods used for this call must be part of the subset of methods of the Kernel/ACL we will always trust (e.g. we will always need to trust Kernel#getApp function, since it is the underlaying function for every proxy call).

5. Switching off the KillSwitch

The emergency upgrade alternative should check that the current version of the application is having an issue while the proposed new one is not. Then, we will need to trust the kill-switch implementation as well. Therefor, we will need to prevent users from kill-switching the kill-switch itself.

⚠️To discuss

Based on the different cases mentioned above we will need to discuss and decide which way we think is best to move forward. I suggest starting the discussion with the following ideas:

When saying "TO TRUST" an implementation, it means to consider that they won't have bugs at all. In other words, that we cannot kill-switch them.

If we want a final decentralized solution, i.e. implemented in a smart contract, we will need TO TRUST in something. Like upgradeability proxies, we cannot upgrade them, we simply trust they delegate calls properly.
Therefore, we will need TO TRUST a final decentralized decision-making process, something like a minimal version of the voting app, it should be as simple and thin as possible.
This final decentralized decision-making process could only be used during emergency^(*) situations, and could only be used to upgrade the app being compromised.
The item mentioned above means we will also need TO TRUST the KillSwitch implementation, otherwise we won't be able to tell the status of an app.
Additionally, given that upgrades are performed through the Kernel, we will also need TO TRUST all the Kernel methods being involved in such process. For example, Kernel#getApp.
Finally, even though the ACL is involved in many functions of the Kernel, we could implement a minimal authorization check for the final decentralized decision-making process, or TO TRUST the ACL implementation as well.

BTW, I've already been doing some proof of concepts about all these ideas.

💡Proposal

In order to start testing the whole kill switch idea, I suggest to have an MVP version that trusts the root of authority (voting) as well, besides the Kernel's and the ACL's methods that are required to perform upgrades. The idea is to delay the implementation of a new minimal decentralized process that acts as the final decision-making one, simply because it is a whole new problem in itself that will probably require a separate development lane.

(*) We refer to an emergency situation only when the base implementation of an app:

is kill-switched, due to a severity issue or because the DAO decided to switch it off, AND
the installed application for such base implementation is the one that has the APP_MANAGER_ROLE role to set a new base app for the broken app, AND
the installed application for such base implementation is the permissions manager of the APP_MANAGER_ROLE role of the DAO.

we will need TO TRUST in something (...) I suggest to have an MVP version that trusts the root of authority (voting) as well, besides the Kernel's and the ACL

I completely agree. The chain of trust can not continue indefinitely. Building "a final decentralized decision-making process with a minimal version of the voting app" doesn't seem like an easy task and we will never have 100% guarantee that it's bug-free, so i would say let's go with this MVP and at least we will have the rest protected.

A couple of comments more:

If I recall correctly from the test code on one of the last commits, the gas overhead is about ~32k, right? This is not negligible, it would be nice to explore reducing it (maybe we could open a separate issue to discuss this)
Just noticing, although it's out of the scope of aragonOS, that trusting the voting app we are implicitly trusting the underlying token.

I think that for an initial rollout of the Kill Switch we can assume that we can trust both the Kernel and the ACL (they wouldn't be protected by the Kill Switch).

For recoverability in case of a genuine bug, I like the idea of designating another address that would be able to perform an upgrade for a critical app if it has been frozen. Organizations would be able to decide whether they want to trust an address to perform this upgrade in case of emergency (which they could change in the future to a minimal voting app if it is implemented) or just whitelist the root authority application so it can never be frozen. I agree that implementing a minimal voting app would be quite an undertaking and not a high priority.

This emergency address that can be used to perform upgrades should also be able to change the default issues registry or the registry for a particular appId, as we should be able to recover in case that whoever controls the registry adds issues to freeze DAOs.

I am still a bit conflicted about how much complexity this emergency address would add and whether we should just trust the root authority for an initial version.

I am still a bit conflicted about how much complexity this emergency address would add and whether we should just trust the root authority for an initial version.

I like the idea of designating an emergency address as well, and started writing some code about it indeed. But I also think we can start testing the idea of a kill-switch without it, I mean, currently there is no way to freeze anything, neither the voting nor any other app. Therefore, between nothing and at least being able to freeze a subset of all the apps, I prefer the subset.

Summing up, I suggest moving forward with an MVP assuming that the Kernel, ACL, KillSwitch and Voting apps will be whitelisted.

Switching off the IssuesRegistry

With the current implementation this actually wouldn't be possible (as none of its methods are killSwitchProtected), but I agree we should be set up to handle this case (or if the user sets a wrong contract as the issue registry).

In this case I think the natural thing to do would be to "propagate" the killswitch up the stack, such that any KillSwitchs relying on the failing IssuesRegistry would be turned on for all non-whitelisted apps.

Switching off the Voting app

The problem with the "root authority" application is that an organization may be set up to only allow other apps to interact with it (by forwarding). In 0.7 organizations, this is now the default (you can only create a new vote by passing through the Token Manager, which checks that you hold organization tokens).

This presents a few problems:

To ensure safety, we should also check in EVMScriptRunner.runScript() that the app has not been killswitched, and
It's useless to keep the Voting app alive if nobody can create new votes.

The second part is quite a trap; if only the Voting app can change permissions, then in the event the Token Manager is killswitched, the organization becomes bricked (nobody can create votes to update apps, and nobody can create votes to update the permissions to let other addresses create votes).

Unfortunately I don't see a generic way to solve this, as the forwarding chain could be infinitely long. We could not implement 1, but this feels absolutely wrong; killswitched apps should not be allowed to run EVM scripts.

The only way out that I see is to have an "emergency address" scheme. On specific conditions that trigger the emergency mode (e.g. a designated app being killswitched, a network-wide switch being flipped, or directly invoked by an entity having the permission):

Permit the address to directly use Kernel.setApp() (or KillSwitch.setAllowedInstance()), or
Assume this address has all roles for the designated "root authority" app and can directly interact with it (in the case of Voting, it would be allowed to directly create a vote, but it would also be allowed to change voting parameters as it wished)
- Perhaps an additional field could be stored stating which role to grant, so the user is only given access to one particular function (e.g. createVote()) and any synchronous forwarding capabilities (e.g. Token Manager)

With the current implementation this actually wouldn't be possible (as none of its methods are killSwitchProtected), but I agree we should be set up to handle this case (or if the user sets a wrong contract as the issue registry).

Now that we moved it within the canPerform function, it does :)

The only way out that I see is to have an "emergency address" scheme. On specific conditions that trigger the emergency mode (e.g. a designated app being killswitched, a network-wide switch being flipped, or directly invoked by an entity having the permission): [...]

I think we all agreed on that, we all like the idea of having an emergency address. However, we still need to discuss whether it must be decentralized or not, IMO it must. But as I said above, "if we still want to have a decentralized process to make that decision, it will be a smart contract which we will have to trust", which means it cannot be kill-switched. Therefore my proposal of moving forward with an MVP implementation assuming that the Kernel, ACL, KillSwitch and the "root authority" apps will be whitelisted.

Therefore my proposal of moving forward with an MVP implementation assuming that the Kernel, ACL, KillSwitch and the "root authority" apps will be whitelisted.

The problem with just whitelisting the "root authority" apps is that it's hard to detect on-chain what should be considered part of the "root authority" due to forwarders. In a default 0.7 org, the Token Manager would have to be considered part of the "root authority" since it is the only one that can create votes. If we only whitelist the Voting app, then we have a liveness problem: nobody can kick start the voting process, so nothing can change.

But, this liveness problem can be solved if we are able to permit the emergency address (which could be ANY_ENTITY) a role (e.g. CREATE_VOTE_ROLE) only in emergency situations. The emergency address would only be allowed to do some things unilaterally, like create a vote to upgrade an app.

A potential model:

In your organization, you can choose an app and one of its roles to designate as the "emergency recovery" role
- You can optionally choose a single address, but by default we assume ANY_ADDRESS. We should discourage setting it to one of the org's apps, as it would create the same liveness problem.
- Access to this "emergency recovery" could be done through a specialized method, to avoid overhead in day-to-day calls to the ACL / etc.
Design an "emergency beacon" that allows organizations to acknowledge and roll past an emergency (likely easiest just going by time)
- If an issue were reported on the Network's IssuesRegistry that affected a forwarding app used as part of an org's "root authority" (we could do an off-chain check for this), emit emergency signal

An alternative, that would be less "global" (it may be a problem if anybody can create votes in any organization until that organization acknowledges the emergency) albeit at the cost of more user interaction, is to set up some rules per organization to detect the emergency mode when an app is killswitched (e.g. in 0.7 orgs, this would just include the Token Manager).

I see, and I think it's a particular case of what I described initially in the issue's description actually. We will still need to whitelist the Voting app, but also grant CREATE_VOTE_ROLE permissions to the emergency address, while allowing it only to create votes to upgrade compromised apps.

Note that the part of granting CREATE_VOTE_ROLE permissions is out of the scope of aragonOS, which makes me think if we should just provide an "emergency upgrade process" and encourage users to grant the create votes role to it...

aragon / aragonOS