test kernel GC code for large-cardinality operations

Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform

Apache License 2.0

323 stars 204 forks source link

test kernel GC code for large-cardinality operations #8402

Closed warner closed 3 weeks ago

warner commented 11 months ago

What is the Problem Being Solved?

Some of the remediations for #8400 and #8401 might cause a vat to drop/retire 100k or more vrefs in a single crank. We'll try to avoid that, but the kernel should be prepared to cope with GC operations on that scale.

When I implemented the kernel GC system, I expected most vats would release a few dozen objects at a time, so I'm worried about things like:

would the worker message exceed the netstring limit? (#8389)
does the kernel-side code syscall.dropImport/etc implementation have any quadratic behavior in processing the vrefs?
the kvStore-hosted GC Action Queue.. originally, I popped items off the queue with JSON.parse(); pop(); JSON.stringify(), which is quadratic, and I don't know if I ever fixed that
when looking at that queue to decide the next action to take, we need to sort by type and vatID, and this sorting process yields a single action, all of which sounds pretty quadratic

In addition to fixing the algorithmic complexity, we should implement some rate-limiting, and interleave GC work with "real"/productive vat deliveries. The current implementation seeks to drain the GC queue before touching the run-queue. GC actions are not limited by a meter, but I think their deliveries will still count against the runPolicy, so we might experience a series of full blocks that still perform no deliveries because all the computrons are spent on GC. (This might not be a bad thing if it only took a minute, but I'm sure folks would be upset if the chain took an hour to get through the backlog).

Description of the Design

The main task is to implement a unit test that uses fakeGC to build a vat with e.g. 10k imports, and then drop them all in a single syscall. The test should finish in a reasonable amount of time. The test should include some instrumentation (perhaps count the number of kvStore accesses) and assert some kind of limit. We should manually change the 10k to other values and measure usage, graph it, and convince ourselves that it's linear, at least over the range of sizes we care about.

Then we should identify and fix any problems that the test reveals.

If we find that it's too hard to achieve that, we should document a practical upper limit that vats are subject to, and consider adding defenses at the kernel-side syscall handlers to reject/ignore attempts to exceed them.

Security Considerations

Vats must not be able to impact the kernel by dropping a lot of objects at the same time.

Scaling Considerations

Kernel operations should scale well with the number of objects being dropped or retired.

Test Plan

The final PR should include a test that runs in a reasonable amount of time, which exercises a sizable GC event, and asserts that some measurable consequence (e.g. number of kvStore reads) is adequately small.

Upgrade Considerations

One outcome might be that we change the way we store the GC action queue, which amounts to a kvStore schema change. The kernel needs to tolerate multiple versions of the kvStore, both the old and new approaches. We must have unit tests to verify this, which will probably involve manually editing a blank kernelStorage to create the old-style keys.

It might also create deeper changes in the GC relationship between vats and the kernel, such as deferring syscall.dropImport consequences until later. This would be a significant change to the protocol (e.g. pipelining drops, which needs protection against "ships crossing in the night" desync problems), and the new kernel would need to handle both versions (e.g. be backwards-compatible with old liveslots, as running in any existing vats).

mhofman commented 11 months ago

I'm sure folks would be upset if the chain took an hour to get through the backlog

It really depends. If this gc processing occurs during a chain upgrade, and we can arrange it to run to completion during the upgrade block (unlimited run policy), I think it may actually be preferable to do it then instead of spreading it over multiple later blocks.

mhofman commented 11 months ago

amounts to a kvStore schema change. The kernel needs to tolerate multiple versions of the kvStore, both the old and new approaches

I agree it's a schema change, but we can perform schema migrations on upgrades such that the new kernel code only needs to deal with the new schema (only the migration code needs to be aware of the old schema)

mhofman commented 11 months ago

the new kernel would need to handle both versions (e.g. be backwards-compatible with old liveslots, as running in any existing vats

I don't quite follow what requires special backwards compatibility. The way I see it, there are 2 level of changes:

discovering new dropped imports: by the end of the BOYD delivery the kernel will have seen all dropImports of the vat. Whether that comes in a single or multiple "syscalls" is the only protocol change the kernel needs to handle. I suspect it could be compatible with both already
staggering of the propagation of dropImport consequences, which from what you describe is currently a sort of "high priority gc queue"

I actually do not understand why we have this higher prioritization. The way I see it, dropImport should be like any other "distributed object message" flowing in our system. As such it should just ride the normal message queue. While that does not immediately solve the problem of how a vat's GC will cause some bunched up gc related activity to occur, we already have a design for that: per vat inbound and outbound queues (#5025). It also guarantees there is no desync problems because all items from a vat's outbound queue must be processed before any more inbound items queue items are, making sure the vat will be explicitly re-introduced to an object it previously dropped.

warner commented 11 months ago

discovering new dropped imports: by the end of the BOYD delivery the kernel will have seen all dropImports of the vat. Whether that comes in a single or multiple "syscalls" is the only protocol change the kernel needs to handle. I suspect it could be compatible with both already

Yeah, just doing more+smaller syscalls is easy.

staggering of the propagation of dropImport consequences, which from what you describe is currently a sort of "high priority gc queue"

I actually do not understand why we have this higher prioritization. The way I see it, dropImport should be like any other "distributed object message" flowing in our system. As such it should just ride the normal message queue.

I think you're right, and it's not necessary. I remember wanting to get GC out of the way promptly, and thus draining the GC-action queue before looking at the run-queue. But I also remember being defensive against the block ending (due to the runPolicy) and device inputs getting a chance to inject messages (changing refcounts) while there were still items on the GC-action queue. So each GC-action is examined to see whether it is still relevant: if the queue says to tell vat v1 to retireImport(ko123), we check that ko123 is still in the vat's c-list before delivering that message.

While that does not immediately solve the problem of how a vat's GC will cause some bunched up gc related activity to occur, we already have a design for that: per vat inbound and outbound queues (#5025). It also guarantees there is no desync problems because all items from a vat's outbound queue must be processed before any more inbound items queue items are, making sure the vat will be explicitly re-introduced to an object it previously dropped.

I'm not sure I want our hypothetical scheduler to be obligated to drain a vat's outbound queue before allowing the deliveries of more inbound queue items.

But, the deeper concern I had is the "ships in the night" problem, and I think there are changes we might make which could introduce this problem. But I'd have to think about it more carefully. I know the comms protocol has this, and I remember discussing solutions with Mark and Dean that included incref/decref messages and counters on both sides. In the swingset context, the concern would be a vat which announces syscall.dropImport, which gets outbound-queued and not noticed by the kernel for a while, meanwhile a dispatch.deliver is sitting on the inbound-queue which will re-introduce the object. The c-list needs to basically sit "outside" the two queues, so the c-list will contain an entry for everything known by the vat or living in a queued item.

Basically I'm just reminding myself that it could be tricky, and to think about it further if we consider deeper changes.

mhofman commented 11 months ago

I'm not sure I want our hypothetical scheduler to be obligated to drain a vat's outbound queue before allowing the deliveries of more inbound queue items.

But, the deeper concern I had is the "ships in the night" problem, and I think there are changes we might make which could introduce this problem. But I'd have to think about it more carefully. I know the comms protocol has this, and I remember discussing solutions with Mark and Dean that included incref/decref messages and counters on both sides. In the swingset context, the concern would be a vat which announces syscall.dropImport, which gets outbound-queued and not noticed by the kernel for a while, meanwhile a dispatch.deliver is sitting on the inbound-queue which will re-introduce the object.

The "process outbound queue items before processing the next inbound queue item" is a current design consequence of the 2 queues per vat approach (#5025), motivated in part by this problem of "ships in the night". dropImport and re-introduction is not the only case, but also promise forwarding and pipelining (which are not currently a concern for liveslots vats, but there is no reason to keep this limitation long term). I think we need to solve this problem for all these cases, not just for GC related messages. A simple way to prevent this problem right now is by disallowing any concurrent crossing at the vat/kernel boundary. It's definitely something we need to figure out by the time we tackle parallel execution (#6447). Luckily the problem is simpler than comms because the boundary between kernel / vat can be made half duplex without much penalty, even when vats live in a separate process.

The c-list needs to basically sit "outside" the two queues, so the c-list will contain an entry for everything known by the vat or living in a queued item.

Yes that is likely the correct approach to remove the vat queues processing order restriction. The way I see it, the per vat queues are a part of the "vat", but not a part of liveslots. That means we need some vat logic outside the xsnap worker that is able to remove dropImports from the "pending outbound queue" if the kernel re-introduces the import in an inbound message, or that can rewrite any inbound send to promise if a promise forwarding is sitting in the "pending outbound queue". All that needs to happen without touching liveslots.

warner commented 9 months ago

I wrote a quick test of kernel behavior in the face of large GC syscalls, and it appears to take linear time. A few things are implemented better than I feared:

vat behavior is the purview of #8417 (and will be examined more closely there), but for this ticket, we (safely) assume that vats collect a list of vrefs that can be freed in an in-RAM Set, and are sent into the kernel in a single (large) syscall.dropImports and/or syscall.retireImports
on the kernel side, we translate these syscalls and invoke vatKeeper.clearReachableFlag() for each kref therein
- that calls kernelKeeper.addMaybeFreeKref(kref), which holds the maybeFreeKrefs Set in RAM, not in the DB. They remain there until the end of the crank
- then the kernel invokes kernelKeeper.processRefcounts(), which does DB refcount lookups for everything in maybeFreeKrefs (linear), builds an in-RAM set of actions, then at the end it does a single call to setGCActions() to write the new set of available GC work to the swing-store in a single kvStore.set call. The value of this call will be large, but linear
- setGCActions() does a sort() (which should be O(N*logN) or better, the list might even be sorted already) and a (linear) JSON.stringify of the (large) array
processGCActionSet() reads the (large) DB item, performs DB lookups on all the contents, then returns a single GC Action that includes all krefs that can be processed the same way: all for the same vat, all with the same mode (dropExport vs retireExport vs retireImport), so if there are a thousand krefs dropped, we'll do a single dispatch.dropExports(krefs) where krefs.length === 1000, instead of doing a thousand separate dropExports([kref])
- this is super-linear in the number of vats: if we had a thousand vats, with one GC action each, it would take a thousand calls to processGCActionSet() to get through them all (each would act on a single kref), each call would take O(numVats), and the total runtime would be O(numVats^2)
- but we never have that case: we might have lots of krefs, but not a lot of vats, and the operation is safely O(N) in the number of krefs
back into the vat again, liveslots will do a few vatstore operations for each kref, which will also populate possiblyDeadSet (in RAM)
- the next BOYD will trigger more vatstore operations and syscalls
- it would be nice if we could provoke the BOYD promptly, during the chain-upgrade operation, so the delay happens at a predictable time, while all validators are doing upgrade things, instead of a few blocks later

My test of just the kernel-side uses a fake vat (no liveslots) that calls syscall.dropImports/etc directly. The test case has a setup phase where the first vat that exports vrefs (in moderately-sized batches) to a second vat. Then the "drop" phase tells the second vat to syscall.dropImports all of those vrefs at the same time, and we measure how long the overall kernel.run() takes. The "retire" phase then has the second vat retireImports everything at once.

N	drop time	retire time
40	4.6ms	6.4ms
400	16.6ms	25.5ms
4,000	125ms	183ms
40,000	1.11s	1.72s
40,0000	11.45s	18.47s

warner commented 7 months ago

.. however, when doing a ghost-replay mainnet simulation of deleting v29-ATOM-USD_price_feed (which, in run-5, participates in 50k #8401 cycles), the process took over 30 minutes (the drop itself took 7min, and the subsequent reapAllVats took 23min). Sending all those GC messages into v9-zoe took a long time:

the dispatch.dropExports with 50k vrefs provoked 152k syscalls, and the total transcript entry was 10MB
the dispatch.retireExports with 50k vrefs provoked 101k syscalls, total entry 7MB
the dispatch.bringOutYourDead provoked 2.5M syscalls, total transcript entry was 224MB

dropExports needs to do an export-status decrement for each vref. I'm guessing this does a has, a get and a set. We might be able to squeeze out the has and reduce the multiplier from 3x to 2x.

retireExports probably does a get and a delete.

bringOutYourDead has to check everything about each vref: both export-status and refcount. Then when it decides it can delete the target, it needs to read the contents (for any outgoing vrefs), then decref them, then check their export-status and refcounts, and finally emit syscall.dropImports/etc for the outgoing sides of the cycle. Given the complete graph of objects and vatstore entries involved, that yields 50 syscalls per cycle, and about 100 bytes of transcript entry per syscall. We might be able to reduce that 50x somewhat (collapse has-plus-get into just get-and-is-not-undefined), and we might be able to spend some more RAM on caching, but I'm not hopeful we can shave off too much.

When I performed the same test with run-9 (in which there are 79k cycles), the kernel process crashed (OOM) after four hours. I don't know what exactly went wrong, but the BOYD syscalls should have scaled linearly (so maybe 4M syscalls), and joining the 358MB transcript entry might have blown past some memory limits, or triggered some superlinear allocation/string-copying behavior in V8.

Even if we find+fix the time/OOM problem, we're still talking about a massive transcript entry. So I think that puts a practical limit on the number of syscalls that any single crank can make, just to manage the resulting transcript entry size (unless/until we move to a transcriptStore model that records syscalls separately, instead of using one-string-per-delivery). Which imposes a limit on how many vrefs we can afford to process in each BOYD.

That means we either need to limit the number of vrefs we tell a vat about before asking it to BOYD (rate-limiting in the kernel), or ask liveslots to limit the number of vrefs it processes during each BOYD (rate-limiting in the vat, maybe expanding the scope of #8417 to include dispatch.dropExports in the set of things that get queued for slower processing).

warner commented 7 months ago

I built a test that uses a non-liveslots local-worker vat, which just does a large number of trivial syscalls (vatstoreSet('key', 'value')). It does a random-sized batch, then an engineGC(), then repeats.

The runtime is loosely proportional to the number of syscalls, but:

if the test has previously done some large batches, then everything takes longer, even small batches
- I interpret this as: V8 has allocated a lot of memory, is trying hard to avoid allocating more, and somehow spends more time doing GC as a result. Maybe the memory pool is horribly fragmented, and a call to engineGC() doesn't consolidate or release enough to help
the test occasionally pauses for a very long time, in one case 75 minutes, but frequently manages to keep running

In the output below, we emit syscall NN lines every 100k syscalls. The t+.. lines indicate the elapsed time between this line and the previous one. The elapsed line shows the time for the overall delivery. transcript-time is how long it took to serialize and write the transcript to the DB, and transcript-size is the JSON.stringify(transcript).length size. All times are in seconds.

-- starting 1120959 syscalls
syscall 0
syscall 100000 (t+10.362569629907611)
syscall 200000 (t+11.075982588052739)
syscall 300000 (t+11.897490797042849)
syscall 400000 (t+4495.102133416891)
syscall 500000 (t+13.950275174141098)
syscall 600000 (t+14.652066380977885)
syscall 700000 (t+15.142665174007561)
syscall 800000 (t+31.86975130486462)
syscall 900000 (t+16.704861508131216)
syscall 1000000 (t+17.286801424026635)
syscall 1100000 (t+43.28629618692412)
deliver done
elapsed:   4686.476825003862
transcript-time:   0.9826616251466476
transcript-size:   58289985

So 1.12M syscalls took one second to serialize, and wrote a 58MB item to transcriptStore.

mhofman commented 7 months ago

That means we either need to limit the number of vrefs we tell a vat about before asking it to BOYD (rate-limiting in the kernel), or ask liveslots to limit the number of vrefs it processes during each BOYD (rate-limiting in the vat, maybe expanding the scope of #8417 to include dispatch.dropExports in the set of things that get queued for slower processing).

I'd rather the kernel kept that list. In the vat it would be stored in the heap (what about upgrades?), and it wouldn't actionable until another BYOD delivery from the kernel anyway.

warner commented 7 months ago

I built another test to emulate the #8401 cycles: it makes two vats, each with a makeScalarBigWeakMapStore, which maps handles to holders (both durable objects). The holder on each side has a state property that (strongly) imports the handle from the other side.

The setup involves temporaries (RAM pillars) that need to be collected before we move to the termination phase, so I do a BOYD at the end of setup.

Setup runs a lot faster when I create the cycles in batches (sending an array of handles/holders instead of a single one). Batches of 1k seemed to work well, I noticed a slight slowdown for 2k. The setup phase is quite linear, as long as I force a BOYD after each batch (otherwise the temporaries accumulate and the final BOYD is very large, and runs into the same accumulate-many-syscalls problem that the termination phase triggers).

In the the teardown phase, I terminate the left vat and force a BOYD. The kernel drops all the terminated vat's imports, which propagate to the right vat as dispatch.dropExports() followed by a dispatch.retireExports(), both of which make export-status (v3.vom.es.${vref}) changes and push vrefs onto possiblyDeadSet for later processing. When the BOYD happens, the right vat does a boatload of refcount/export-status checks (lots of vatstoreGets) and object deletions (vatstoreDelete). I think it is this large number of syscalls which ultimately provokes the kernel-side problems.

I observed the beginnings of superlinearity in the terminate phase between 32k and 64k cycles (2.2x instead of 2.0x), and clear problems at 128k (30x instead of 2.0x). In this table, all times are in seconds, and kvstore is the total number of kvstore entries (excluding the ones created by the batch setup phase; there are some lingering queueToKref promises that inflated the counts):

# count   setup     drop     total   kvstore
# 1000    9.55      6.50     29.2    28246
# 2000    18.88     12.47    44.7    56246
# 4000    37.68     25.27    76.5    112246
# 8000    75.83     50.67    140.0   224246
# 16000   149.76    105.31   269.0   448246
# 32000   276.20    205.08   495.5   896246
# 64000   538.51    449.73   1002.2  1792246
# 128000  1046.10   13294.13 14336   3584246

To make this a more useful test, I'd like it to fail earlier. The actual Zoe vat has a number of other objects referenced by the cycle (but not supporting it), which will increase the number of syscalls it does during the BOYD. My next step will be to add this "ballast" and see if I can get a test+count that can be set up in a reasonable time but which both completes in a tolerable time and shows clear signs of superlinearity, so we can confidently use it to validate our code fix (#8417).

warner commented 3 weeks ago

I think I've done enough testing here. The kernel-side transcript limitations means that we need to avoid large cranks (lots of syscalls), and the earliest way to accomplish that is to delete vats slowly (#8928), for which most of the code has just landed (just the cosmic-swingset integration part is left). That's not a complete fix, we still need defenses against vats which do to much, but we can't survive deleting the large price-feed vats (that resulted from #8400 / #8401) without slow deletion.

So I'm going to close this ticket now, I think it's done its job.