Open warner opened 10 months ago
It might also be useful to change reapAllVats
to look at this countdown, and skip reaping the vats which haven't received any deliveries since the last one. We might then change the name to reapAllDirtyVats
.
In my ghost-replay testing, I'm doing a cleanup pass before submitting the new operation (so I can tell what DB changes are a result of my injected action, versus GC consequences that were just waiting around for a BOYD), and I've found that I need to make two reapAllVats()
calls to clean everything up. This is the usual "one hop per BOYD" issue, where vatA gets reaped before vatB, but vatB releases some things that were coming from vatA, and you need a BOYD on vatA to let the cleanup process terminate. (The worst case would be a linked list that travels through multiple vats: you could only free one link per reapAllVats()
). Each reapAllVats()
must reload all vats to delivery the BOYD, which requires a transcript replay, and the vat-warehouse LRU cache gets thrashed, now that we have more vats than its maxVats size.
Making reapAllVats()
free for untouched vats would make it a lot cheaper to do several of them, both avoiding the replay and not knocking out the still-active vats.
We might even want this to also trigger a heap snapshot for all vats (which, really, would mean removing the enqueue-BOYD step and just follow the "time to make a heap snapshot" path, since that path also does a BOYD). While testing large GC operations, I'm noticing a significant amount of time being spent replaying a vat which just completed a large BOYD. The unfortunate LRU-thrashing property of reapAllVats
, combined with needing to do multiple ones (to handle multiple hops through the object graph), means:
reapAllVats()
reapAllVats()
This results in duplicate work, some of which is super-expensive (millions of syscalls, 10 minutes of wallclock).
Of course, this would also be handled elegantly by @mhofman 's suggestion to schedule heap snapshots based on computrons, rather than simply counting deliveries. The large BOYD would trigger a snapshot. #6786
We should also change the snapshot trigger to look at (or remember) the last delivery, and refrain from a new BOYD if that was the last thing we did in that vat.
It might also be useful to change
reapAllVats
to look at this countdown, and skip reaping the vats which haven't received any deliveries since the last one. We might then change the name toreapAllDirtyVats
.
This makes a lot of sense to me. I don't think we need to rename the operation, we just note that reaping vats that haven't had any activity in them since the last time goes really fast.
The large BOYD would trigger a snapshot
We'd have this weird situation where BOYD triggers a snapshot and snapshot triggers a BOYD, so it seems like it might mainly be a question of which goes first. Perhaps the alternate route is to reframe reapAllVats
to directly initiate a snapshot rather than a BOYD, though it seems weird to initiate snapshots on vats that are currently swapped out because they got swapped out in a post-BOYD state to begin with. The observation is that it's the BOYD of other vats that causes stuff to be freed, in which world swapping a vat in just to immediate swap it out again wouldn't necessarily be so crazy, since if significant GC happens the second snapshot could be smaller.
The other thing that might be worth considering is having the BOYD-them-all logic to somehow be in cahoots with the cache, perhaps by using knowledge of what's currently swapped out vs. what's swapped in to tweak the order of sweeps so they don't interact with the cache so pathologically. I have memories of Alan Karp talking about big numerical calculations at IBM where MRU instead of LRU gave optimal cache benefit.
It might also be useful to change
reapAllVats
to look at this countdown, and skip reaping the vats which haven't received any deliveries since the last one. We might then change the name toreapAllDirtyVats
.
Yup I suggested something similar in https://github.com/Agoric/agoric-sdk/issues/8626#issuecomment-1847819729, and on slack but using the vatPos
instead. reapCountdown
would be a lot cleaner.
this would also be handled elegantly by @mhofman 's suggestion to schedule heap snapshots based on computrons, rather than simply counting deliveries. The large BOYD would trigger a snapshot. #6786
We should also change the snapshot trigger to look at (or remember) the last delivery, and refrain from a new BOYD if that was the last thing we did in that vat.
As I mention in that issue, I think we should only run BOYD and snapshotting together, triggered by the computrons of previous deliveries. As such the schedule would not count the cost of BOYD and snapshotting towards the interval, but I think that's fine. If anything, a costly BOYD should increase the interval at which you perform it (which could be captured as adding the computron cost of the last BYOD to the "next reap meter" instead of decreasing from it)
it seems weird to initiate snapshots on vats that are currently swapped out because they got swapped out in a post-BOYD state to begin with
I don't understand. If BOYD and snapshotting is an atomic operation, how can that happen?
Perhaps the alternate route is to reframe
reapAllVats
to directly initiate a snapshot rather than a BOYD
That doesn't seem right. Snapshotting is conceptually an internal detail of some types of workers, where BOYD is a concept applying to all vats. If anything snapshots should be an implicit effect of BOYD for those workers, and never explicitly performed by the controller.
What is the Problem Being Solved?
Vats are normally GCed (
dispatch.bringOutYourDead()
) by a per-vat timer namedreapCountdown
. This is initialized toreapInterval
(set to 1000 on mainnet), and decremented for each delivery to the vat. When it hits zero, a BOYD is performed, and the counter is reset.We recently landed
controller.reapAllVats()
, which schedules an immediate (well, at the start of the nextcontroller.run()
BOYD to all vats. However this doesn't change the "organic" BOYD timing:reapCountdown
is not affected, so the vat will still do the usual BOYD when it hits zero.For purposes of our benchmarks, we'd like to reduce the variation caused by organically-scheduled BOYDs. We can do a
reapAllVats()
just before starting the benchmark, and again just afterwards (so the object counts are accurate), but it would be less noisy if we could reduce the frequency of BOYDs in the middle. We should consider configuring the benchmark environment to usedefaultReapInterval: 'never'
, however it would also make a certain amount of sense ifreapAllVats()
were to reset the counters back toreapInterval
.(Now I'm on the fence about whether this is a good idea.. using
defaultReapInterval: 'never'
seems best, and if we use that, then it doesn't matter whether we reset the counters. But I'll capture the idea for further discussion).Description of the Design
If implemented, we would:
vatKeeper.resetReapCountdown()
kernel.reapAllVats()
to call resetReapCountdown just after enqueuing the BOYDSecurity Considerations
none
Scaling Considerations
none
Test Plan
Update the unit test to inspect the kvStore after calling
reapAllVats()
, to assert that the counters have all been reset.Upgrade Considerations
none
cc @FUDCo