Open warner opened 9 months ago
Ok, so runPolicy()
will provide a way for the host app to limit how much termination work gets done, and when. For cosmic-swingset, I think we'll add one run to each block, at the end, which only gets performed if no other run spent any computrons. This run will allow a single round of termination-cleanup (by having allowCleanup()
return false
on all other calls than the first one), and will limit that round to only doing 5 items (by having it return 5
on the first call). For the price-feed vats that we're looking at, that will either:
The garbage in run-21 (08-feb-2024) contains:
Our remediation will just delete (slowly) the two large price-feed vats: v29 (ATOM-USD) and v68 (stATOM-USD). Most of the v7-board weak imports are coming from those two vats, as are most of the zoe cycles, and most of the quote payments are virtual objects in those vats.
Then, by implementing #8980 and setting the threshold to trigger a BOYD every 20 krefs:
while deleting c-list imports which feed zoe cycles:
syscall.retireImports
will probably delete the entangled exports, so they won't be around by the time we reach the exports phase)while deleting c-list imports which are recognizable by v7-board:
retireExports
into vat-board, and will trigger a snapshotwhile deleting vatstore data:
That points to about 23 days to finish the remediation, assuming that the chain is mostly idle (so every block gets to do a little bit of cleanup).
We could speed that up fairly safely by making a fancier scheduler which knows the difference between termination work that triggers GC actions, and work (vatstore deletions) which does not. For example, if we said that each budget=1
allows one c-list entry or two vatstore entries to be processed, then the vatstore-deletion phase would run twice as fast (8 days total) and delete 10 entries per block.
PR #9227 implements the swingset side of this. We'll still need changes to cosmic-swingset to provide a suitable runPolicy
. I'm thinking that we only do cleanup work in empty blocks (so have allowCleanup()
return false
in all runs, then look at the total used computrons, and iff that was zero, do an additional run which allows budget=5
cleanup and up to 65M computrons). Eventually we might want to change that to allow some cleanup work in non-empty-but-non-full blocks, but I think we should get some experience with the slower form before we enable a faster rate.
What is the Problem Being Solved?
We've been brainstorming ways to address the large price-feed vats which are holding onto a lot of virtual objects (#8400 , plus their contribution to #8401). While the fixes to stop their growth are coming along nicely, we will still eventually need a way to delete all the old data. The simplest (and most desirable) approach is to simply terminate those vats, however with the current kernel, that would perform an unacceptably large amount of work, and trigger some O(N^2) performance bugs in the kernel GC handling code, which would probably crash the chain.
8877 is a workaround to have the vat release objects slowly, to avoid these problems. But a better / longer-term solution is to allow the kernel to safely terminate large vats, yet still perform the cleanup work slowly, over time, so it doesn't interfere with normal operations.
This ticket is about an approach which would fix all of the problems with the price-feed vats, as well as any vat which performed a "malicious large termination" (aka the "I'm too big to fail, hahaha" attack). It would still be vulnerable to vats which
collection.clear()
or delete a large collection (for which #8417 would be the fix).The rough idea is to change
terminateVat
to merely mark the vat as terminated, and not immediately delete all the late vat's data (delete vatstore entries, delete clist entries and propagate decrefs, abandon exported objects). We would defer that work until later "cleanup" steps, performed when therunPolicy
says it's a good time (eg if/when there is remaining budget at the end of a block), and only do a small amount of work at a time. This work could be spaced out over months.The kernel would still reject all the vat's outstanding Promises at the time of termination, so that their downstream effects are felt promptly. This isn't a problem with our current vats (
v29-ATOM-USD_price_feed
only decides three promises), but could be a too-big-to-fail attack vector in the future. The kernel would also immediately kill the worker process.Description of the Design
the old design
When
terminateVat()
(kernel.js) is called, it does the following steps immediately:kernelKeeper.enumeratePromisesByDecider()
to find all the promises decided by this vat, and then callresolveToError()
on all of themkernelKeeper.cleanupAfterTerminatedVat()
to delete the vat's statevatKeeper.deleteSnapshotsAndTranscript()
to delete historical/replay-enabling statec.o+
exports, andorphanKernelObject()
each of themc.o-
imports, anddeleteCListEntry
each, which decrements refcountsvat.dynamicIDs
and/orvat.names
notifyTermination()
to let vat-admin fire thedone()
promise, so userspace (eg parent vat) can reactvatWarehouse.stopWorker()
to shut down the xsnap workerWhen the vat has a large number of non-exported virtual objects (eg most of the QuotePayments in #8400), step B4 takes a long time. When those objects are exported, B2 takes a long time (eg the QuotePayments that are weakly referenced by vat-board).
When the vat is importing a lot of objects, such as the #8401 cycles, B3 takes a long time, and also adds a lot of krefs to
maybeFreeKrefs
, and at end-of-crank, theprocessRefCounts()
call will add a lot of actions togcActions
. At the start of the next crank,processGCActionSet
will deliver a large GC delivery to the upstream vat (likedispatch.dropImports()
with 100k krefs into v9-zoe), which will trigger a very long delivery (with some multiple of 100k syscalls, all of which must get added into the transcript entry (causing bad memory behavior and JSON.stringify serialization time). Then, the next time v9-zoe does a BOYD, it will do some other multiple of 100k syscalls (to check on the status of all those entries, and probably free up additional objects as the cycles unwind), causing the same many-syscalls transcript problem, after which it will do its own very largesyscall.dropImports
orsyscall.retireExports
call with 100k vrefs, triggering more work.the new design
We'll introduce a new
kvStore
key namedvats.terminated
, whose value is a simple JSON-serialized array of VatID strings.KernelKeeper
will keep an in-RAM cache of it as a Set namedterminatedVats
, write a sorted version tokvStore
every time we change it, and populate the cache at kernel boot (and write an empty list if the key did not already exist, which is our backwards-compatibility/upgrade point).We'll change
kernelKeeper.ownerOfKernelObject
to consult both${kref}.owner
andterminatedVats
, so an object whose.owner
is still present, but that vatID is terminated, will report the object as orphaned. We also changekernelKeeper.vatIsAlive
to reportfalse
for terminated vats (either by having it consultterminatedVats
too, or by preemptively deleting the${vatID.o.nextID}
key that it looks for).Next, we change
terminateVat()
to add the vatID toterminatedVats
, and skip steps B2, B3, and B4. This will leave the bulk of the vat's data (and the cleanup work) for later, allowingterminateVat()
to finish quickly (and without provoking immediate large work).Now, we introduce a new internal
kernel.doSomeCleanup(vatID, budget)
method which performs up tobudget
total steps of:orphanKernelObject()
)deleteCListEntry()
)c.o+
export entries left to perform B1 on)terminatedVats
To get this work scheduled, we change the
runPolicy
to add anallowCleanup()
method. If present, this can return{ budget }
, or can return{}
to mean "do all the cleanup now". If it returns something falsy, that tells the kernel that no cleanup work should be done at this time. For backwards compatibility, if the method is missing, the kernel acts as if it returned{}
(unlimited cleanup).Then we add a new pseudo-run-queue message
{ type: 'cleanup', vatID, budget }
, a sibling to GC actions and reap-actions and actual run-queue messages likedeliver
andnotify
. We changegetNextMessageAndProcessor()
function to takerunPolicy
as an argument, and change its priority list to consultterminatedVats
before looking at anything else (but only if the policy allows cleanup right now). So its sequence will look like:budget = policyInput.allowCleanup()
This provides a default behavior that nearly matches the old one: vat termination work runs to completion on the step just after the termination happened, but still before any other work gets done.
But, the final step is to change cosmic-swingset to provide a
runPolicy.allowCleanup = () => false
for all the existing runs, then to perform one final run (with=> ({ budget })
) iff there is still room left in the block (eg if the previous runs did not exhaust the computron budget). I thinkbudget = 10
or100
is likely.So assuming the block that terminates the vat has free space, the final run will orphan eg 100 objects of the dead vat. If we assume that all these objects were also recognized by eg vat-board (and #7212 gets fixed), then this will push
dispatch.retireImports()
GC actions onto the queue, and these will be delivered as the second crank of this cleanup run. That may cause the recognzing vat to delete some collection entries, which incur syscalls and might create more garbage in those vats, but no more than 100 at a time.The next non-full block will orphan 100 more. This will repeat, 100 krefs per block, until all the exports have been orphaned. The following block will begin on the c-list imports. Each block will decref 100 krefs, which will immediately push a 100-kref
dispatch.dropExports
call onto the GC action queue, as well as a similarretireExports
call. Then thedropExports
will be delivered, causing eg v9-zoe to do syscalls (change export-status torecognizable
) on all 100, and accumulating 100 vrefs in its internal liveslotspossiblyDeadSet
. Next, theretireExports
will be delivered, which will trigger syscalls to delete the export status. Each block will process 100 krefs in this way.Eventually, v9-zoe will get enough deliveries to trigger a BOYD, which will perform more vatstore syscalls to check for remaining reference counts, and when those counts come up empty, liveslots will delete those virtual objects, which will trigger more deletions (unwinding the cycles). The amount of work will be bounded by the number of accumulated deleted objects, which only happen 100 at a time.
Once all the terminated vat's imports are deleted, the next non-full block will start on B3, and will delete 100 vatstore entries. This doesn't cause any activity in other vats, so the only cost is
kvStore
activity (which also causes IAVL churn, but only 100 entries at a time). This continues until all remaining vat state is deleted.Then the final non-full block can remove the vatID from
terminatedVats
, and the kernel has finally completely forgotten about the vat. The next call togetNextMessageAndProcessor()
will not find any cleanup work to do.The final requirement is that we run BOYDs frequently enough to avoid aggregating our carefully-spaced-out cleanup work back into a single large delivery. Our current scheduler has two ways to trigger a BOYD:
${vatID.reapInterval}
triggers a BOYD every 1000 deliverieskernel.snapshotInterval
triggers a heap snapshot (and associated BOYD) every 200 deliveriesstartVat
If we didn't change that, and we assume that eg v9-zoe is otherwise idle, we'd observe 100 pairs of
dropExports(100 krefs)
andretireExports(100 krefs)
before a BOYD was triggered, so v9-zoe would be doing cleanup on 10k krefs at a time. This might be larger than we want (certainly better than doing all 200k at once, but we'd really prefer to keep things limited to thebudget
, notbudget * snapshotInterval
).So the last step is to change the way we schedule BOYDs. As @mhofman has recommended, the scheduler should pay attention to more than just the number of deliveries. We can count computrons, deliveries, and also the number of krefs that have been dropped or retired. When any of these values grows above some threshold, we can trigger a BOYD and heap snapshot.
(the trick is to avoid doing multiple heap snapshots in a single block, because that's just wasted effort)
Security Considerations
We must continue to maintain some invariants.
Dead vats tell no tales
We stop the vat worker, and delete the metadata keys that would allow vat-warehouse to start a new one. So the dead vat will never again get agency, so it never gets a chance to emit new messages.
You can't speak to the dead (vat)
All messages and deliveries to the now-dead vat must go splat. We ensure this by having all message deliveries check for the vatID in
terminatedVats
first.other security-ish considerations
We don't want other vat code to be able to tell that termination work is being spread out over time. By rejecting all the promises immediately (and firing
done()
right away), we provoke the only userspace-visible consequences of termination. The remainder are GC effects (such as WeakMap entries being deleted), which we've been careful to hide from userspace.This change should close off a significant portion of attacks in which a malicious vat attempts to overload the kernel with cleanup work. It does not prevent high-cardinality
syscall.dropImports()
: those are better managed with a kernel-side policy to terminate vats which try to drop too much (just like a too-many-computrons-per-delivery policy), combined with an #8417 -style liveslots change to avoid getting terminated.It also does not prevent the threat of lots of outstanding Promises. We could choose to address that by only reject a few of them each time through (moving step A into the slow path), trading off kernel robustness against letting other vats react promptly to the termination event.
Scaling Considerations
This should enable the kernel to survive the termination of large-scale vats.
It does, of course, cost more time/CPU overall. The data structures it accesses are all constant-time, so the overhead comes from the extra per-block checks (like the
kvStore.getNextKey()
to see if there is more work to be done). But I think the number of extra calls will be minimal.Test Plan
Unit tests on the kernel to terminate a vat with lots of state and observe that the work is spread out over many calls. Unit tests on the new runPolicy() method and its effects.
Before deployment (specifically before we trigger a vat termination), we'll need a main-fork simulation of deleting both large price-feed vats. I'm particularly interested/concerned with the shadow-into-IAVL overhead of the DB churn. I think this approach gives us the best chance of reducing that churn to a managable size, but I still want to know exactly how long the real chain will take to perform this cleanup, ane especially how the eventual cosmos-sdk state pruning event plays out.
Upgrade Considerations
The only swing-store (kvStore) change is the new
vats.terminated
key. This will be missing in vN-1, and when vN starts, the kernel will notice the missing key and initialize it to an empty list. No vats will be in the partially-terminated state upon entry to vN, so we don't need to accomodate any existing state over than the "missing key" one.