Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
303 stars 191 forks source link

terminate old mainnet price-feed vats #9483

Open warner opened 3 weeks ago

warner commented 3 weeks ago

What is the Problem Being Solved?

8400 identified misbehavior in the mainnet price-feed vats which caused their storage and c-list use to grow constantly over time. #8401 identified misbehavior in zoe/contract interactions that caused similar symptoms. The cause of these problems has been fixed, and these fixes will be deployed in mainnet upgrade-16, in #9370. This will upgrade Zoe to stop producing most #8401 cycles, and will introduce new price-feed vats that do not suffer from the #8400 growth.

Unfortunately that deployment will (necessarily) leave the old bloated price-feed vats in place, which will retain the troublesome storage and c-list entries (both in the price-feed vats themselves, and in the upgraded vat-zoe because of the cycles). Terminating the price-feed vats would free up that storage, but until the changes of #8928 are deployed, deleting that much data in a single step would crash the kernel, or at least cause a multi-hour -long stall.

So the task, after #8928 is deployed, is to deploy an action to the chain that will terminate the old price-feed vats.

Description of the Design

I think @Chris-Hibbert told me that Zoe retains the adminNode necessary to call E(adminNode).terminateWithFailure(), but does not expose it to the bootstrap/core-eval environment, which means a single core-eval would not be sufficient to get these vats terminated.

One option is to wait for #8687 to be deployed to mainnet (also in a chain-halting kernel upgrade), and then have the outside-the-kernel upgrade handler call controller.terminateVat() on each of these vatIDs. These two steps could happen in the same chain-halting upgrade.

Another option is to change Zoe somehow to allow governance to trigger terminateWithFailure(), and then perform governance actions to get the vats terminated. This would be better from an authority point of view, but would require more upgrade steps to get there, and might or might not fit our long-term model of how authority should be distributed.

Security Considerations

The ability to terminate a vat must be pretty closely held, as continued operation of a contract is a pretty fundamental part of what chain platform promises. #8687 is not an authority violation because it is not reachable by any vat (it must be invoked by the host application, outside the kernel entirely), but that doesn't mean it's a very comfortable superpower to exercise. It'd be nice to find a good "in-band" means to get these old vats shut down.

Scaling Considerations

We must be careful to not overload the chain. The #8928 slow-termination feature enables this, and introduces a rate limit to control how much cleanup work is allowed during each idle block. We need to carefully test and closely monitor the cleanup rate to make sure the "background work" does not cause problems with validators. We know that doing significant swingset work causes the slower validators to miss the voting window (the number of votes per block drops significantly for a block or two after non-trivial swingset runs). The cleanup parameters are chosen to keep the cleanup work down to about 100ms on my local machine, but we don't know exactly how long that will take on the slower validators. There will always be a slowest validator, and giving them work to do will always have the potential to knock them out of the pool, but we should try to avoid too much churn, especially because cleanup will be ongoing for weeks or months, and there won't be any idle blocks during that time, so they will not have the same chance to catch up as they would normally have.

If multiple vats have been terminated-but-not-deleted, #8928 only does cleanup on one vat at a time. So (in theory) it should be safe to delete all the vats at once, and the rate-limiting code will slowly crawl through all of the first vat's garbage before it eventually moves on to the second, etc. But this needs to be tested carefully.

Test Plan

We definitely need an a3p-like test to exercise whatever API we end up using.

In addition, we need to find some way to simulate and measure the actual rate-limiting. I intend to experiment with mainfork-based clones of the mainnet state, and deploy some hacked up version of this ticket, to watch how cleanup unfolds on real mainnet-sized data.

Upgrade Considerations

this is all about performing a chain upgrade to trigger this deletion

### Tasks
- [ ] https://github.com/Agoric/agoric-sdk/pull/8682
- [ ] https://github.com/Agoric/agoric-sdk/pull/9227
- [ ] https://github.com/Agoric/agoric-sdk/pull/9253
Chris-Hibbert commented 3 weeks ago

I think @Chris-Hibbert told me that Zoe retains the adminNode necessary to call E(adminNode).terminateWithFailure(), but does not expose it to the bootstrap/core-eval environment,

Zoe has it, and uses it to terminate in various circumstances, to return the done() promise, and a few other things, but it doesn't provide access to it externally. Zcf has it, so the contract can terminate itself. Zoe doesn't have any way to recognize a governor, so it can't provide access, but the contractGovernor inside the contract has access to ZCF.

We ought to be able to (in the future) give governance the ability to terminate the contract, but existing contracts would require both their governor and the contract to be upgraded to take advantage of this.