Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
322 stars 201 forks source link

do basic perf analysis of a recent loadgen run #4525

Open warner opened 2 years ago

warner commented 2 years ago

What is the Problem Being Solved?

From https://github.com/Agoric/agoric-sdk/issues/2630#issuecomment-1022723000 we have a list of 7 performance questions that I want to have answers for. @mhofman has a service running every night, which collects enough data to answer them. The service does some amount of aggregation / number-crunching already, plus there's the raw slogfiles and kernel databases. He's going to provide me with docs (ssh hostname, directories, how to interpret the data files I'll find there) to get access to that data.

Once I have that access, my task is to apply the ad-hoc slogfile parsing tools I've assembled to produce answers for those seven questions, and put the results here. The answers will take the form "we added 3 kernel objects per loadgen cycle, but we want N=0", or "each loadgen cycle adds 12ms to the (time spent in swingset per loadgen cycle), but we want that to be 0".

warner commented 2 years ago

I'm studying the performance of a loadgen run that @mhofman 's scripts ran around 26-Jan-2022. We've got 12 hours of data, on a single-node chain that was bombarded with vault and AMM tasks as fast as it could keep up. I'm looking at slogfiles collected from the overall run. I'm looking for evidence that things are running more slowly over time, rather than being flat.

I paired the deliver and deliver-result events and used them to compute an elapsed time for various kinds of deliveries. Then I filtered by vatID, delivery type, method name, and then ran a simple linear regression.

# bydelivery.py
 18170.613s bringOutYourDead
  2781.097s    message
  1611.921s     notify
   217.177s dropExports
   119.596s retireImports
    96.822s retireExports
    --
 22997.226s total

In a 12 hour test (43600s), we spent 23000 seconds making vat deliveries: the swingset utilization was about 53%. Of that time, 18170s (79% of swingset time) was spent doing bringOutYourDead deliveries (GC flush and finalizer execution), 4393s (19%) processing dispatch.deliver messages and dispatch.notify promise resolutions, and 433s (2%) doing GC deliveries.

If we focus on v11 (zoe):

# alltrend.py
vat v11
 message(exit)
   85.975s in 7882 calls (blocks 65 - 7443)
   cumulative elapsed: 85.975302s
   first  0.009790s (avg100) / 0.010379s (linregr)
   last   0.011712s (avg100) / 0.011454s (linregr)
 message(getUpdateSince)
   67.128s in 12099 calls (blocks 64 - 7443)
   cumulative elapsed: 67.128041s
   first  0.004768s (avg100) / 0.005199s (linregr)
   last   0.006053s (avg100) / 0.005934s (linregr)
 message(replaceAllocations)
   64.129s in 6782 calls (blocks 65 - 7442)
   cumulative elapsed: 64.129382s
   first  0.008181s (avg100) / 0.008877s (linregr)
   last   0.009246s (avg100) / 0.010058s (linregr)
 message(getPayout)
   50.339s in 9164 calls (blocks 65 - 7441)
   cumulative elapsed: 50.338783s
   first  0.004082s (avg100) / 0.005233s (linregr)
   last   0.004759s (avg100) / 0.005764s (linregr)
 message(makeInvitation)
   41.269s in 4582 calls (blocks 64 - 7439)
   cumulative elapsed: 41.269035s
   first  0.007984s (avg100) / 0.008173s (linregr)
   last   0.011454s (avg100) / 0.009879s (linregr)
 message(makeNoEscrowSeat)
   36.863s in 3300 calls (blocks 64 - 7442)
   cumulative elapsed: 36.862807s
   first  0.010085s (avg100) / 0.010244s (linregr)
   last   0.012016s (avg100) / 0.012132s (linregr)
 message(deposit)
   36.706s in 4582 calls (blocks 66 - 7445)
   cumulative elapsed: 36.705964s
   first  0.006303s (avg100) / 0.007590s (linregr)
   last   0.014650s (avg100) / 0.008448s (linregr)
 message(offer)
   27.905s in 4582 calls (blocks 64 - 7439)
   cumulative elapsed: 27.905091s
   first  0.006350s (avg100) / 0.005998s (linregr)
   last   0.005857s (avg100) / 0.006187s (linregr)
 message(getOfferResult)
   26.167s in 4582 calls (blocks 67 - 7445)
   cumulative elapsed: 26.166575s
   first  0.004794s (avg100) / 0.005551s (linregr)
   last   0.006253s (avg100) / 0.005877s (linregr)
 message(withdraw)
   25.585s in 2291 calls (blocks 71 - 7440)
   cumulative elapsed: 25.585170s
   first  0.009648s (avg100) / 0.010096s (linregr)
   last   0.011157s (avg100) / 0.012279s (linregr)
 message(mintAndEscrow)
    8.148s in 1100 calls (blocks 65 - 7425)
   cumulative elapsed: 8.147702s
   first  0.007695s (avg100) / 0.007416s (linregr)
   last   0.007852s (avg100) / 0.007398s (linregr)
 message(withdrawAndBurn)
    8.028s in 1100 calls (blocks 72 - 7442)
   cumulative elapsed: 8.027862s
   first  0.006425s (avg100) / 0.007520s (linregr)
   last   0.006930s (avg100) / 0.007070s (linregr)
 --
 deliver(message)
   478.242s in 62046 calls (blocks 64 - 7445)
   cumulative elapsed: 478.241713s
   first  0.005649s (avg100) / 0.007233s (linregr)
   last   0.007348s (avg100) / 0.008208s (linregr)
 deliver(bringOutYourDead)
   7524.493s in 107593 calls (blocks 64 - 7447)
   cumulative elapsed: 7524.493023s
   first  0.009227s (avg100) / 0.002583s (linregr)
   last   0.146454s (avg100) / 0.140493s (linregr)
 deliver(notify)
   228.577s in 22819 calls (blocks 64 - 7443)
   cumulative elapsed: 228.576519s
   first  0.009058s (avg100) / 0.008958s (linregr)
   last   0.009755s (avg100) / 0.011121s (linregr)
 deliver(dropExports)
   106.208s in 18146 calls (blocks 64 - 7447)
   cumulative elapsed: 106.208077s
   first  0.003609s (avg100) / 0.005010s (linregr)
   last   0.004867s (avg100) / 0.006725s (linregr)
 deliver(retireExports)
   45.548s in 11182 calls (blocks 64 - 7447)
   cumulative elapsed: 45.548214s
   first  0.003002s (avg100) / 0.002787s (linregr)
   last   0.005268s (avg100) / 0.005398s (linregr)
 deliver(retireImports)
   18.558s in 5682 calls (blocks 64 - 7442)
   cumulative elapsed: 18.558226s
   first  0.002463s (avg100) / 0.002903s (linregr)
   last   0.003589s (avg100) / 0.003644s (linregr)
 --
 8401.626s all deliveries

We spend 7524s doing bringOutYourDead, 478s doing messages, and 229s doing notify (process resolutions). Focussing further on bringOutYourDead:

 deliver(bringOutYourDead)
   7524.493s in 107593 calls (blocks 64 - 7447)
   cumulative elapsed: 7524.493023s
   first  0.009227s (avg100) / 0.002583s (linregr)
   last   0.146454s (avg100) / 0.140493s (linregr)

The time consumed by each dispatch.bringOutYourDead grows. The first 100 calls averaged 9.2ms each, but the last 100 calls averaged 146ms, a 15x growth. A simple linear regression (against block height) intercepts the first block at 2.6ms, and the last block at 140ms, which confirms a growth of roughly 19us per block.

We see a similar pattern on other vats. v22 (which receives messages named makeLoanInvitation, makeCloseInvitation, getDebtAmount, and handleOffer, so it's probably the Vault vat):

 deliver(bringOutYourDead)
   8034.227s in 54640 calls (blocks 64 - 7447)
   cumulative elapsed: 8034.227020s
   first  0.010283s (avg100) / 0.004954s (linregr)
   last   0.294659s (avg100) / 0.294064s (linregr)

v20 (probably autoswap, with makeSwapInInvitation and quoteGiven, among others):

 deliver(bringOutYourDead)
   446.984s in 30196 calls (blocks 64 - 7442)
   cumulative elapsed: 446.983660s
   first  0.009726s (avg100) / 0.010551s (linregr)
   last   0.030755s (avg100) / 0.019402s (linregr)

v5 (withdraw, burn, mintPayment, and deposit) had slower growth:

 deliver(bringOutYourDead)
   185.513s in 21719 calls (blocks 64 - 7447)
   cumulative elapsed: 185.513204s
   first  0.006317s (avg100) / 0.006768s (linregr)
   last   0.009822s (avg100) / 0.010389s (linregr)

However v15 (vattp) had far more bringOutYourDead calls and didn't show growth:

 deliver(bringOutYourDead)
   730.347s in 150693 calls (blocks 63 - 7447)
   cumulative elapsed: 730.346920s
   first  0.004513s (avg100) / 0.004704s (linregr)
   last   0.004447s (avg100) / 0.004996s (linregr)

v12 (with toBridge, inbound, and getMyAddress, so maybe vat-bridge?) showed similarly minor growth:

 deliver(bringOutYourDead)
   116.437s in 16940 calls (blocks 63 - 7446)
   cumulative elapsed: 116.436749s
   first  0.006677s (avg100) / 0.006695s (linregr)
   last   0.007284s (avg100) / 0.007059s (linregr)

We know that a GC flush takes time O(n) in the number of live objects. We also believe there is no good reason for Zoe to be growing: the loadgen tasks should be cleaning up after themselves and not leaving outstanding object references.

So the next analysis task is to pick the simplest vat which showed significant growth in the bringOutYourDead time, and look at its object c-lists to see if it was being asked to retain a significant number of objects. If so, the fault is in the importing vat, for not dropping those object references. If not, the fault is in the growing vat somehow, for example by keeping internal references to the externally-dropped objects, or for some exciting engine-level GC failure (the Moddable folks identified and fixed one a few months ago, but maybe there's more).

warner commented 2 years ago

We have a copy of the full node DB at both the halfway point (stage1) and the finish point (stage2). I can grep that for the c-lists at those two points to estimate their growth.

v20 (probably autoswap), which showed a 2-3x slowdown in bringOutYourDead time, shows v20.o.nextID growing from 5547 to 9659, which means about 4k objects were imported during that time. At both samples the size of the imported set (grep 'v20.c.o-') is exactly 51, which means the vat was emitting syscall.dropImports as expected.

For exports, grep 'v20.c.o+' reports exactly 21 objects at both samples, so the export set was not growing. Over the whole run, the vat received 7848 dispatch.dropExports during the run (each of which can drop multiple krefs), dropping 9564 objects in total, 7798 unique. This means other vats were correctly dropping imports that came from v20. The combination suggests that the 2-3x slowdown in bringOutYourDead is either caused by code inside the vat accumulating objects (causing the heap size to grow, and then slow down GC), or something at the engine level is not emptying the heap as expected.

v11 (zoe) showed a 15x-50x slowdown. v11.o.nextID went from 15463 to 27221 (12k imports), and v11.c.o- shows a change from 2562 to 4482 (2k non-freed imports). So during the last half, zoe received 12k objects, dropped 10k, and held on to 2k.

For exports, v11.c.o+ went from 1949 to 3389, so zoe is exporting 1440 extra objects from that last half. At stage1, the highest o+NN was o+22597, and at stage2 it was 39763, delta of 17166. Over the whole run, the vat received 18146 dispatch.dropExports, dropping 36383 objects (all unique). So during the second half, zoe exported 17166 objects, the other vats who received them (and/or the kernel) dropped 15726 of them, leaving 1440 still alive for zoe to maintain. 90% were dropped, but that should have been 100%.

Conclusions so far: v11 (zoe)'s 15-50x slowdown might be explained by other vats retaining objects that they shouldn't, causing the heap to grow and legitimately slowing down GC. v20's lesser (2-3x) slowdown is not explained by additional exports.

warner commented 2 years ago

Using the stage2 DB snapshot to figure out which vats are importing from which others, it looks like the potential import leaks are:

That might suggest a cross-vat import cycle between v11 (zoe) and v22 (vault), possibly making zoe retain imports from v5 (which I think is vat-bank, it only has an Issuer of some sort).

A random sampling of those 1115 things v11 imports from v5 shows a BLD Payment, sent from v5 to v11 via a promise resolution. The causal history is:

@erights I don't know Zoe or the BLD issuer well enough: I assume it makes sense to have an allocation with 0 tokens, but does it make sense to get a Payment for 0 tokens? And what should the caller do with that?

This kind of smells like something is receiving a 0-token Payment, decides it isn't worth the trouble to deposit(), but as a result the Payment gets kept around forever.

warner commented 2 years ago

One theory to examine:

To test, look at the kernel object table refcounts for those objects, see if the reachable count is 0 but the recognizable count is non-zero.

warner commented 2 years ago

Distinguishing between reachability and recognizability, I find:

A random sample (n=3) of the 2206 recognizable-but-not-reachable "weak imports" by v11 of objects exported by v22 shows that all are SeatHandleHandles, created by v22-vault and sent to v11-zoe in a makeNoEscrowSeat() call. A manual causality reconstruction suggests the sequence within v22-vault is something like:

handleOffer(..) {
  await E(timer).getCurrentTimestamp();
  ..
  E(zoething).makeNoEscrowSeat(
    {},
    { exit: { onDemand: null }, give: {}, want: {} },
    exitObj,
    SeatHandleHandle
  );
  ..
}

v11-zoe performs a syscall.dropImports of the object a few cranks later, but does not do syscall.retireImports. This indicates that something in zoe has used the SeatHandleHandle as the key of a WeakMap. When v22-vault receives the corresponding dispatch.dropExports on the next crank, it does not perform a syscall.retireExports, which indicates that something in v22-vault is maintaining a strong reference to the SeatHandleHandle (i.e. the export was not the only thing keeping it alive).

v22-vault sends the SeatHandleHandle back to zoe in a series of replaceAllocations() calls (in my three samples, two did 3 calls, and one did 0 calls). Each time, zoe drops the object shortly thereafter.

At no point does v22-vault emit a syscall.retireExports.

My conclusion is that something within v22-vault is holding onto the SeatHandleHandle even though the loadgen cycle which prompted it has completed. And the continued liveness of the SeatHandleHandle is keeping other things alive with v11-zoe (possibly including some of its imports).

@erights from that description, can you figure out what SeatHandleHandle v22-vault is creating that would be used in this way, but held onto indefinitely by the vault code?

Thinking about tooling, it would be nice to have a debugger-traced local replay of v22-vault, with a breakpoint that fires the moment a given vref is allocated. However, we already know which crank caused the export of this object (it was triggered by the resolution of the getCurrentTimestamp result promise), and the breakpoint would not fire in the same turn where the E().makeNoEscrowSeat() call appeared (because the liveslots handler always fires in a later turn). Perhaps instrumenting the HandledPromise Proxy and breaking if methodname === 'makeNoEscrowSeat' would get closer.

warner commented 2 years ago

scanning zoe's zcfZygote.js, the only makeNoEscrowSeat call appears in makeEmptySeatKit, which stores seatHandle exclusively in seatData, which is passed exclusively to makeZCFSeat. That one stores seatHandle in the value of a WeakStore named zcfSeatToSeatHandle, so it will stick around until the zcfSeat goes away. The zcfSeat could be kept alive by exitObj, or by whoever called makeEmptySeatKit.

Looking at packages/run-protocol/src/vaultFactory/vault.js (and assuming that's where these come from), I see three calls to makeEmptySeatKit, which store their results in liquidationSeat, vaultSeat, and burnSeat (which is scoped to closeHook()). I'm guessing vaultSeat is the one causing trouble. I see a comment that says "`vaultSeat will hold the collateral until the loan is retired".

Maybe the loadgen vault code I wrote is not retiring a Vault properly? https://github.com/Agoric/testnet-load-generator/blob/main/loadgen/contract/agent-create-vault.js is the client, and it uses makeCloseInvitation. Could someone take a look and see if it's missing anything obvious?

warner commented 2 years ago

Oh, but it's not v14-comms that's holding on to anything. In this report, 12 / 10 means that v14-comms has 12 total imports from v5-bootstrap (both weak and strong, all recognizable), but only 10 "strong imports" (both reachable and recognizable):

v14 imports X from Y:
     12 /    10 from    v5
     10 /    10 from   v11
      6 /     6 from    v1
      4 /     4 from   v22
      4 /     4 from   v12
      3 /     3 from    v6
      2 /     2 from   v15
      1 /     1 from    v7
      1 /     1 from    v2
      1 /     1 from   v16
      1 /     1 from    v9
      1 /     1 from   v28
      1 /     1 from   v20

So the external ag-solo is only holding on to a few dozen objects. Any leaks are happening entirely within the chain.

warner commented 2 years ago

Current leak understanding:

mhofman commented 2 years ago

That does indeed smell like a reference cycle.

warner commented 2 years ago

This might be good for our new hire to take on.

Tartuffo commented 2 years ago

FYI @arirubinstein

Tartuffo commented 2 years ago

@warner Can we close this? Has the tooling we did with honeycomb subsumed this?