Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
323 stars 204 forks source link

test kernel GC code for large-cardinality operations #8402

Closed warner closed 3 weeks ago

warner commented 11 months ago

What is the Problem Being Solved?

Some of the remediations for #8400 and #8401 might cause a vat to drop/retire 100k or more vrefs in a single crank. We'll try to avoid that, but the kernel should be prepared to cope with GC operations on that scale.

When I implemented the kernel GC system, I expected most vats would release a few dozen objects at a time, so I'm worried about things like:

In addition to fixing the algorithmic complexity, we should implement some rate-limiting, and interleave GC work with "real"/productive vat deliveries. The current implementation seeks to drain the GC queue before touching the run-queue. GC actions are not limited by a meter, but I think their deliveries will still count against the runPolicy, so we might experience a series of full blocks that still perform no deliveries because all the computrons are spent on GC. (This might not be a bad thing if it only took a minute, but I'm sure folks would be upset if the chain took an hour to get through the backlog).

Description of the Design

The main task is to implement a unit test that uses fakeGC to build a vat with e.g. 10k imports, and then drop them all in a single syscall. The test should finish in a reasonable amount of time. The test should include some instrumentation (perhaps count the number of kvStore accesses) and assert some kind of limit. We should manually change the 10k to other values and measure usage, graph it, and convince ourselves that it's linear, at least over the range of sizes we care about.

Then we should identify and fix any problems that the test reveals.

If we find that it's too hard to achieve that, we should document a practical upper limit that vats are subject to, and consider adding defenses at the kernel-side syscall handlers to reject/ignore attempts to exceed them.

Security Considerations

Vats must not be able to impact the kernel by dropping a lot of objects at the same time.

Scaling Considerations

Kernel operations should scale well with the number of objects being dropped or retired.

Test Plan

The final PR should include a test that runs in a reasonable amount of time, which exercises a sizable GC event, and asserts that some measurable consequence (e.g. number of kvStore reads) is adequately small.

Upgrade Considerations

One outcome might be that we change the way we store the GC action queue, which amounts to a kvStore schema change. The kernel needs to tolerate multiple versions of the kvStore, both the old and new approaches. We must have unit tests to verify this, which will probably involve manually editing a blank kernelStorage to create the old-style keys.

It might also create deeper changes in the GC relationship between vats and the kernel, such as deferring syscall.dropImport consequences until later. This would be a significant change to the protocol (e.g. pipelining drops, which needs protection against "ships crossing in the night" desync problems), and the new kernel would need to handle both versions (e.g. be backwards-compatible with old liveslots, as running in any existing vats).

mhofman commented 11 months ago

I'm sure folks would be upset if the chain took an hour to get through the backlog

It really depends. If this gc processing occurs during a chain upgrade, and we can arrange it to run to completion during the upgrade block (unlimited run policy), I think it may actually be preferable to do it then instead of spreading it over multiple later blocks.

mhofman commented 11 months ago

amounts to a kvStore schema change. The kernel needs to tolerate multiple versions of the kvStore, both the old and new approaches

I agree it's a schema change, but we can perform schema migrations on upgrades such that the new kernel code only needs to deal with the new schema (only the migration code needs to be aware of the old schema)

mhofman commented 11 months ago

the new kernel would need to handle both versions (e.g. be backwards-compatible with old liveslots, as running in any existing vats

I don't quite follow what requires special backwards compatibility. The way I see it, there are 2 level of changes:

I actually do not understand why we have this higher prioritization. The way I see it, dropImport should be like any other "distributed object message" flowing in our system. As such it should just ride the normal message queue. While that does not immediately solve the problem of how a vat's GC will cause some bunched up gc related activity to occur, we already have a design for that: per vat inbound and outbound queues (#5025). It also guarantees there is no desync problems because all items from a vat's outbound queue must be processed before any more inbound items queue items are, making sure the vat will be explicitly re-introduced to an object it previously dropped.

warner commented 11 months ago

discovering new dropped imports: by the end of the BOYD delivery the kernel will have seen all dropImports of the vat. Whether that comes in a single or multiple "syscalls" is the only protocol change the kernel needs to handle. I suspect it could be compatible with both already

Yeah, just doing more+smaller syscalls is easy.

staggering of the propagation of dropImport consequences, which from what you describe is currently a sort of "high priority gc queue"

I actually do not understand why we have this higher prioritization. The way I see it, dropImport should be like any other "distributed object message" flowing in our system. As such it should just ride the normal message queue.

I think you're right, and it's not necessary. I remember wanting to get GC out of the way promptly, and thus draining the GC-action queue before looking at the run-queue. But I also remember being defensive against the block ending (due to the runPolicy) and device inputs getting a chance to inject messages (changing refcounts) while there were still items on the GC-action queue. So each GC-action is examined to see whether it is still relevant: if the queue says to tell vat v1 to retireImport(ko123), we check that ko123 is still in the vat's c-list before delivering that message.

While that does not immediately solve the problem of how a vat's GC will cause some bunched up gc related activity to occur, we already have a design for that: per vat inbound and outbound queues (#5025). It also guarantees there is no desync problems because all items from a vat's outbound queue must be processed before any more inbound items queue items are, making sure the vat will be explicitly re-introduced to an object it previously dropped.

I'm not sure I want our hypothetical scheduler to be obligated to drain a vat's outbound queue before allowing the deliveries of more inbound queue items.

But, the deeper concern I had is the "ships in the night" problem, and I think there are changes we might make which could introduce this problem. But I'd have to think about it more carefully. I know the comms protocol has this, and I remember discussing solutions with Mark and Dean that included incref/decref messages and counters on both sides. In the swingset context, the concern would be a vat which announces syscall.dropImport, which gets outbound-queued and not noticed by the kernel for a while, meanwhile a dispatch.deliver is sitting on the inbound-queue which will re-introduce the object. The c-list needs to basically sit "outside" the two queues, so the c-list will contain an entry for everything known by the vat or living in a queued item.

Basically I'm just reminding myself that it could be tricky, and to think about it further if we consider deeper changes.

mhofman commented 11 months ago

I'm not sure I want our hypothetical scheduler to be obligated to drain a vat's outbound queue before allowing the deliveries of more inbound queue items.

But, the deeper concern I had is the "ships in the night" problem, and I think there are changes we might make which could introduce this problem. But I'd have to think about it more carefully. I know the comms protocol has this, and I remember discussing solutions with Mark and Dean that included incref/decref messages and counters on both sides. In the swingset context, the concern would be a vat which announces syscall.dropImport, which gets outbound-queued and not noticed by the kernel for a while, meanwhile a dispatch.deliver is sitting on the inbound-queue which will re-introduce the object.

The "process outbound queue items before processing the next inbound queue item" is a current design consequence of the 2 queues per vat approach (#5025), motivated in part by this problem of "ships in the night". dropImport and re-introduction is not the only case, but also promise forwarding and pipelining (which are not currently a concern for liveslots vats, but there is no reason to keep this limitation long term). I think we need to solve this problem for all these cases, not just for GC related messages. A simple way to prevent this problem right now is by disallowing any concurrent crossing at the vat/kernel boundary. It's definitely something we need to figure out by the time we tackle parallel execution (#6447). Luckily the problem is simpler than comms because the boundary between kernel / vat can be made half duplex without much penalty, even when vats live in a separate process.

The c-list needs to basically sit "outside" the two queues, so the c-list will contain an entry for everything known by the vat or living in a queued item.

Yes that is likely the correct approach to remove the vat queues processing order restriction. The way I see it, the per vat queues are a part of the "vat", but not a part of liveslots. That means we need some vat logic outside the xsnap worker that is able to remove dropImports from the "pending outbound queue" if the kernel re-introduces the import in an inbound message, or that can rewrite any inbound send to promise if a promise forwarding is sitting in the "pending outbound queue". All that needs to happen without touching liveslots.

warner commented 9 months ago

I wrote a quick test of kernel behavior in the face of large GC syscalls, and it appears to take linear time. A few things are implemented better than I feared:

My test of just the kernel-side uses a fake vat (no liveslots) that calls syscall.dropImports/etc directly. The test case has a setup phase where the first vat that exports vrefs (in moderately-sized batches) to a second vat. Then the "drop" phase tells the second vat to syscall.dropImports all of those vrefs at the same time, and we measure how long the overall kernel.run() takes. The "retire" phase then has the second vat retireImports everything at once.

N drop time retire time
40 4.6ms 6.4ms
400 16.6ms 25.5ms
4,000 125ms 183ms
40,000 1.11s 1.72s
40,0000 11.45s 18.47s
warner commented 7 months ago

.. however, when doing a ghost-replay mainnet simulation of deleting v29-ATOM-USD_price_feed (which, in run-5, participates in 50k #8401 cycles), the process took over 30 minutes (the drop itself took 7min, and the subsequent reapAllVats took 23min). Sending all those GC messages into v9-zoe took a long time:

dropExports needs to do an export-status decrement for each vref. I'm guessing this does a has, a get and a set. We might be able to squeeze out the has and reduce the multiplier from 3x to 2x.

retireExports probably does a get and a delete.

bringOutYourDead has to check everything about each vref: both export-status and refcount. Then when it decides it can delete the target, it needs to read the contents (for any outgoing vrefs), then decref them, then check their export-status and refcounts, and finally emit syscall.dropImports/etc for the outgoing sides of the cycle. Given the complete graph of objects and vatstore entries involved, that yields 50 syscalls per cycle, and about 100 bytes of transcript entry per syscall. We might be able to reduce that 50x somewhat (collapse has-plus-get into just get-and-is-not-undefined), and we might be able to spend some more RAM on caching, but I'm not hopeful we can shave off too much.

When I performed the same test with run-9 (in which there are 79k cycles), the kernel process crashed (OOM) after four hours. I don't know what exactly went wrong, but the BOYD syscalls should have scaled linearly (so maybe 4M syscalls), and joining the 358MB transcript entry might have blown past some memory limits, or triggered some superlinear allocation/string-copying behavior in V8.

Even if we find+fix the time/OOM problem, we're still talking about a massive transcript entry. So I think that puts a practical limit on the number of syscalls that any single crank can make, just to manage the resulting transcript entry size (unless/until we move to a transcriptStore model that records syscalls separately, instead of using one-string-per-delivery). Which imposes a limit on how many vrefs we can afford to process in each BOYD.

That means we either need to limit the number of vrefs we tell a vat about before asking it to BOYD (rate-limiting in the kernel), or ask liveslots to limit the number of vrefs it processes during each BOYD (rate-limiting in the vat, maybe expanding the scope of #8417 to include dispatch.dropExports in the set of things that get queued for slower processing).

warner commented 7 months ago

I built a test that uses a non-liveslots local-worker vat, which just does a large number of trivial syscalls (vatstoreSet('key', 'value')). It does a random-sized batch, then an engineGC(), then repeats.

The runtime is loosely proportional to the number of syscalls, but:

In the output below, we emit syscall NN lines every 100k syscalls. The t+.. lines indicate the elapsed time between this line and the previous one. The elapsed line shows the time for the overall delivery. transcript-time is how long it took to serialize and write the transcript to the DB, and transcript-size is the JSON.stringify(transcript).length size. All times are in seconds.

-- starting 1120959 syscalls
syscall 0
syscall 100000 (t+10.362569629907611)
syscall 200000 (t+11.075982588052739)
syscall 300000 (t+11.897490797042849)
syscall 400000 (t+4495.102133416891)
syscall 500000 (t+13.950275174141098)
syscall 600000 (t+14.652066380977885)
syscall 700000 (t+15.142665174007561)
syscall 800000 (t+31.86975130486462)
syscall 900000 (t+16.704861508131216)
syscall 1000000 (t+17.286801424026635)
syscall 1100000 (t+43.28629618692412)
deliver done
elapsed:   4686.476825003862
transcript-time:   0.9826616251466476
transcript-size:   58289985

So 1.12M syscalls took one second to serialize, and wrote a 58MB item to transcriptStore.

mhofman commented 7 months ago

That means we either need to limit the number of vrefs we tell a vat about before asking it to BOYD (rate-limiting in the kernel), or ask liveslots to limit the number of vrefs it processes during each BOYD (rate-limiting in the vat, maybe expanding the scope of #8417 to include dispatch.dropExports in the set of things that get queued for slower processing).

I'd rather the kernel kept that list. In the vat it would be stored in the heap (what about upgrades?), and it wouldn't actionable until another BYOD delivery from the kernel anyway.

warner commented 7 months ago

I built another test to emulate the #8401 cycles: it makes two vats, each with a makeScalarBigWeakMapStore, which maps handles to holders (both durable objects). The holder on each side has a state property that (strongly) imports the handle from the other side.

The setup involves temporaries (RAM pillars) that need to be collected before we move to the termination phase, so I do a BOYD at the end of setup.

Setup runs a lot faster when I create the cycles in batches (sending an array of handles/holders instead of a single one). Batches of 1k seemed to work well, I noticed a slight slowdown for 2k. The setup phase is quite linear, as long as I force a BOYD after each batch (otherwise the temporaries accumulate and the final BOYD is very large, and runs into the same accumulate-many-syscalls problem that the termination phase triggers).

In the the teardown phase, I terminate the left vat and force a BOYD. The kernel drops all the terminated vat's imports, which propagate to the right vat as dispatch.dropExports() followed by a dispatch.retireExports(), both of which make export-status (v3.vom.es.${vref}) changes and push vrefs onto possiblyDeadSet for later processing. When the BOYD happens, the right vat does a boatload of refcount/export-status checks (lots of vatstoreGets) and object deletions (vatstoreDelete). I think it is this large number of syscalls which ultimately provokes the kernel-side problems.

I observed the beginnings of superlinearity in the terminate phase between 32k and 64k cycles (2.2x instead of 2.0x), and clear problems at 128k (30x instead of 2.0x). In this table, all times are in seconds, and kvstore is the total number of kvstore entries (excluding the ones created by the batch setup phase; there are some lingering queueToKref promises that inflated the counts):

# count   setup     drop     total   kvstore
# 1000    9.55      6.50     29.2    28246
# 2000    18.88     12.47    44.7    56246
# 4000    37.68     25.27    76.5    112246
# 8000    75.83     50.67    140.0   224246
# 16000   149.76    105.31   269.0   448246
# 32000   276.20    205.08   495.5   896246
# 64000   538.51    449.73   1002.2  1792246
# 128000  1046.10   13294.13 14336   3584246

To make this a more useful test, I'd like it to fail earlier. The actual Zoe vat has a number of other objects referenced by the cycle (but not supporting it), which will increase the number of syscalls it does during the BOYD. My next step will be to add this "ballast" and see if I can get a test+count that can be set up in a reasonable time but which both completes in a tolerable time and shows clear signs of superlinearity, so we can confidently use it to validate our code fix (#8417).

warner commented 3 weeks ago

I think I've done enough testing here. The kernel-side transcript limitations means that we need to avoid large cranks (lots of syscalls), and the earliest way to accomplish that is to delete vats slowly (#8928), for which most of the code has just landed (just the cosmic-swingset integration part is left). That's not a complete fix, we still need defenses against vats which do to much, but we can't survive deleting the large price-feed vats (that resulted from #8400 / #8401) without slow deletion.

So I'm going to close this ticket now, I think it's done its job.