add controller.reapAllVats()

warner commented 11 months ago

What is the Problem Being Solved?

We might be able to survive remediating #8400 and #8401 in "one fell swoop", where we do a chain-halting upgrade, which upgrades the vats involved, to code whose buildRootObject deletes or clears the durable Sets and Maps that are holding the 100k+ objects alive. That will free a lot of objects, all of which will get GCed at the next BOYD event.

To ensure that this BOYD happens during the upgrade interval, and not some random number of blocks afterwards, we should have a way for the upgade code to request that the kernel forces a dispatch.bringOutYourDead to all vats. We can then follow this up with a controller.run() and flush out all the garbage. This may take a considerable amount of time (I'm roughly estimating ten minutes), but as long as this completes within the upgrade downtime for all validators, the process will be survivable (merely annoying). If for some reason, the large BOYD didn't happen during the downtime, then it would happen some unknown number of blocks later (depending upon how much activity those vats experienced), causing a surprising multi-minute chain stall. Worse, becomes some validator computers are faster than others, this stall would probably knock the slower ones out of the validator set.

Description of the Design

Add a controller.reapAllVats() method. This will enumerate all vats (static and dynamic) and perform a dispatch.bringOutYourDead() on each of them, or inject something into the run-queue to perform the same.

The BOYDs do not necessarily need to complete by the time reapAllVats() returns, but all the work should complete during the next full controller.run() call.

Security Considerations

This represents a performance hit, but it is only reachable by the host application (via controller), not anything inside a vat.

Scaling Considerations

This might take a long time, but better to stall at a known time than at a surprising one.

Test Plan

The unit test should create two or three vats, send them different numbers of messages (to get their deliveries-until-BOYD counters to different states), then perform the controller.reapAllVats() and a controller.run(). The test should somehow assert that the BOYDs happened (either snooping on the slogSender for type=='deliver' && kd[0]=='bringOutYourDead', or examining the swing-store transcript entries).

Upgrade Considerations

none

warner commented 11 months ago

Note for ourselves later: the kernel services the reapQueue before it looks at the run-queue. So our upgrade-time reaping process needs to do:

await controller.run();
controller.reapAllVats();
await controller.run();

to ensure that no GC-work-adding deliveries occur after the BOYDs are executed. The first controller.run() must be complete: it should not be limited by a runPolicy, otherwise there might still be runQueue items left over that won't happen until after the BOYDs.

mhofman commented 11 months ago

Run after reap will likely free more objects. Should we execute reap a few times until nothing is found anymore?

Edit: in that case, we only need to run reap in the vats that have seen deliveries. In some of my early experiment, I plumbed an option for the last know position for each vat, so that reap "all" could reap only vats that had new deliveries.

Agoric / agoric-sdk