Closed warner closed 11 months ago
Note for ourselves later: the kernel services the reapQueue
before it looks at the run-queue. So our upgrade-time reaping process needs to do:
await controller.run();
controller.reapAllVats();
await controller.run();
to ensure that no GC-work-adding deliveries occur after the BOYDs are executed. The first controller.run()
must be complete: it should not be limited by a runPolicy
, otherwise there might still be runQueue
items left over that won't happen until after the BOYDs.
Run after reap will likely free more objects. Should we execute reap a few times until nothing is found anymore?
Edit: in that case, we only need to run reap in the vats that have seen deliveries. In some of my early experiment, I plumbed an option for the last know position for each vat, so that reap "all" could reap only vats that had new deliveries.
What is the Problem Being Solved?
We might be able to survive remediating #8400 and #8401 in "one fell swoop", where we do a chain-halting upgrade, which upgrades the vats involved, to code whose
buildRootObject
deletes or clears the durable Sets and Maps that are holding the 100k+ objects alive. That will free a lot of objects, all of which will get GCed at the next BOYD event.To ensure that this BOYD happens during the upgrade interval, and not some random number of blocks afterwards, we should have a way for the upgade code to request that the kernel forces a
dispatch.bringOutYourDead
to all vats. We can then follow this up with acontroller.run()
and flush out all the garbage. This may take a considerable amount of time (I'm roughly estimating ten minutes), but as long as this completes within the upgrade downtime for all validators, the process will be survivable (merely annoying). If for some reason, the large BOYD didn't happen during the downtime, then it would happen some unknown number of blocks later (depending upon how much activity those vats experienced), causing a surprising multi-minute chain stall. Worse, becomes some validator computers are faster than others, this stall would probably knock the slower ones out of the validator set.Description of the Design
Add a
controller.reapAllVats()
method. This will enumerate all vats (static and dynamic) and perform adispatch.bringOutYourDead()
on each of them, or inject something into the run-queue to perform the same.The BOYDs do not necessarily need to complete by the time
reapAllVats()
returns, but all the work should complete during the next fullcontroller.run()
call.Security Considerations
This represents a performance hit, but it is only reachable by the host application (via
controller
), not anything inside a vat.Scaling Considerations
This might take a long time, but better to stall at a known time than at a surprising one.
Test Plan
The unit test should create two or three vats, send them different numbers of messages (to get their deliveries-until-BOYD counters to different states), then perform the
controller.reapAllVats()
and acontroller.run()
. The test should somehow assert that the BOYDs happened (either snooping on theslogSender
fortype=='deliver' && kd[0]=='bringOutYourDead'
, or examining the swing-store transcript entries).Upgrade Considerations
none