Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

xsnap-zoe segfault during replay of phase4d testnet #3451

Open warner opened 3 years ago

warner commented 3 years ago

On my testnet-monitor-2 node (follower, non-validator), following the agorictest-16 testnet, the kernel process was killed by the host machine's OOM killer during the processing of block 84107 around 6:44am PDT sat 03-jul-2021. The host machine has 4GiB of RAM and 0.5GiB of swap, and the swingset kernel process (Node.js) was recorded as using 16.5GiB VmSize and 1.876GiB RSS, so it's not particularly surprising that the machine ran out of memory, and the Node.js process was the best target to kill. The OOM killer was provoked by a (failed?) allocation attempt by an xsnap process; the kernel does not record the arguments, but does record all processes considered for termination, and I can see that the provoking xsnap was using more memory than any others (VmSize 60591 pages, vs 15610 for most of the rest), so it's likely that vat-zoe was the trigger. The swingset kernel was killed just after finishing a delivery to vat-zoe:

$ tail -2 chain.out
{"time":1625319855.8086834,"type":"deliver","crankNum":530461,"vatID":"v10","deliveryNum":64801,"kd":["message","ko51158",{"method":"getPayout","args":{"body":"[\"In\"]","slots":[]},"result":"kp210790"}],"vd":["message","o+29505",{"method":"getPayout","args":{"body":"[\"In\"]","slots":[]},"result":"p-59060"}]}
{"time":1625319856.3706696,"type":"deliver-result","crankNum":530461,"vatID":"v10","deliveryNum":64801,"dr":["ok",null,{"meterType":"xs-meter-8","compute":31654,"allocate":201457696,"allocateChunksCalls":1,"allocateSlotsCalls":40,"garbageCollectionCount":65203,"mapSetAddCount":2757733,"mapSetRemoveCount":94908,"maxBucketSize":381}]}

But, when restarting this node about 3.5 hours later, the xsnap process hosting vat-zoe segfaulted during replay, which is surprising:

2021-07-03T18:42:02.035Z SwingSet: kernel: v15 online: xs-worker static name=timer transcript entries replayed: 58
2021-07-03T18:42:02.476Z SwingSet: kernel: v12 online: xs-worker static name=vatAdmin transcript entries replayed: 62
2021-07-03T18:42:19.238Z SwingSet: kernel: v14 online: xs-worker static name=vattp transcript entries replayed: 193
2021-07-03T18:42:23.873Z SwingSet: kernel: ##### KERNEL PANIC: unable to re-create vat v10 #####
Cannot initialize Controller ExitSignal: v10:zoe exited due to signal SIGSEGV

The host kern.log recorded a brief detail:

Jul  3 18:42:23 testnet-monitor-2 kernel: [326582.134718] show_signal: 5 callbacks suppressed
Jul  3 18:42:23 testnet-monitor-2 kernel: [326582.134722] traps: xsnap[162638] general protection fault ip:7f6a99c235e4 sp:7ffca8dcc6a8 error:0 in libc-2.33.so[7f6a99bb9000+16b000]

The restart-time chain.slog finished with:

{"time":1625337739.2822566,"type":"vat-startup-start","vatID":"v10"}

I have a copy of the swingset state (~/.ag-chain-cosmos/data/ag-cosmos-chain-state/), so I'm hoping to be able to reproduce this under gdb, especially because the crash appears to have happened either during xsnap restore-from-snapshot or during the first replayed-transcript delivery, and should not be dependent upon e.g. the rest of the chain feeding in more transactions.

Note that the deliveryNum in question (64801) means the kernel would have instructed the xsnap process to write a heap snapshot just after the delivery completed. So it's reasonable to believe that the snapshot write process would have experienced a malloc failure (and maybe worse). Indeed the snapshot directory shows a partially-written file:

warner@testnet-monitor-2:~/stuff/agoric/tlg-agorictest-16$ ls -ltr ~/.ag-chain-cosmos/data/ag-cosmos-chain-state/xs-snapshots/|tail -5
-rw-rw-r-- 1 warner warner    333052 Jul  3 13:37 63d7ff546dda12e06063c0b48096e6c6185609a1ac6cd6f41688c21be54413aa.gz
-rw-rw-r-- 1 warner warner   7398527 Jul  3 13:41 f7f61681dfdf6a039cb2a032e1d01859e26af6af9118a846cb316882811eaff4.gz
-rw-rw-r-- 1 warner warner   4348192 Jul  3 13:42 98ed8f602a45ad6cfa7226f96bbe9470b80cdffaad2197757ad70af7b1fe40d6.gz
-rw-rw-r-- 1 warner warner   4387551 Jul  3 13:43 21fe3986f6a9149d2da9b6e2578297b7a29b2bdce72de324db1ff090b4c6c07a.gz
-rw-rw-r-- 1 warner warner 203783515 Jul  3 13:44 save-raw-A5B38x.xss
warner@testnet-monitor-2:~/stuff/agoric/tlg-agorictest-16$

The entire block was aborted, so I'd expect to see the kernel LMDB entry still pointing at the previous snapshot. The transcript sqlite file is appended as deliveries are finished, but the starting-offset pointer lives in LMDB and that should still have the same value as it did before the block was started (i.e. the transcript has "future echoes" in the tail, which will be ignored at next startup).

This doesn't explain the segfault: in the new launch, the zoe process should have been reloaded from the previous (complete) snapshot, and replayed the transcript from the previous starting point. There would have been fewer than 200 deliveries to replay (because any other zoe deliveries during that block were abandoned too).

cc @dckc @mhofman @FUDCo

priority: medium: we need to understand why this happened, and fix it, but it can wait a week or two, as long as we make sure it doesn't happen during the next testnet exercise

dckc commented 3 years ago

@warner please let me know (offline) how to log in to testnet-monitor-2.

If an ssh key helps, please use the 1st ssh key in https://github.com/dckc.keys

dckc commented 3 years ago

Also, point me to the tools for running just 1 vat from a slogfile?

p.s. https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/bin/extract-transcript-from-kerneldb.js and https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/bin/replay-transcript.js

perhaps somewhat dusty.

dckc commented 3 years ago

ok... updated extract-transcript-from-kerneldb.js a90b3e0b8 83c916d61

warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$  node bin/extract-transcript-from-kerneldb.js ~/.ag-chain-cosmos/data/ag-cosmos-chain-state zoe

extracting transcript for vat v10 into transcript-v10.sst
options: {
  vatParameters: { zcfBundleName: 'zcf' },
  description: 'static name=zoe',
  name: 'zoe',
  managerType: 'xs-worker'
}
vatParameters: { zcfBundleName: 'zcf' }
64792 transcript entries
dckc commented 3 years ago
warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$ node -r esm ~/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js transcript-v10.sst 
argv [ 'transcript-v10.sst' ]
replay-one-vat.js transcript.sst
using transcript transcript-v10.sst
manager created
delivery 3: ["message","o+0",{"method":"buildZoe","args":{"body":"[{\"@qclass\":\"slot\",\"iface\":\"Alleged: vatAdminService\",\"index\":0}]","slots":["o-50"]},"result":"p-60"}]
RUN ERR (Error#1)
Error#1: delivery replay with no transcript
  at makeError (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2572:17)
  at fail (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2700:20)
  at baseAssert (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2718:13)
  at Object.replayOneDelivery (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/src/kernel/vatManager/manager-helper.js:166:5)
  at replay (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js:151:21)
  at processTicksAndRejections (internal/process/task_queues.js:93:5)
  at async run (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js:170:3)

p.s. we (@michaelfig and I) used these tweaks to get it running:

diff --git a/packages/SwingSet/bin/replay-transcript.js b/packages/SwingSet/bin/replay-transcript.js
index 2b1b76b1e..77aa3f2de 100644
--- a/packages/SwingSet/bin/replay-transcript.js
+++ b/packages/SwingSet/bin/replay-transcript.js
@@ -46,6 +46,7 @@ async function replay(transcriptFile, worker = 'xs-worker') {
   const fakeKernelKeeper = {
     provideVatKeeper: _vatID => ({
       addToTranscript: () => undefined,
+      getLastSnapshot: () => undefined,
     }),
   };
   const kernelSlog = { write() {} };
@@ -127,6 +128,7 @@ async function replay(transcriptFile, worker = 'xs-worker') {
         vatConsole: console,
         vatParameters,
         compareSyscalls,
+   useTranscript: true,
       };
       const vatSyscallHandler = undefined;
       manager = await factory.createFromBundle(
@@ -167,7 +169,7 @@ async function run() {
   }
   const [transcriptFile] = args;
   console.log(`using transcript ${transcriptFile}`);
-  await replay(transcriptFile, 'local');
+  await replay(transcriptFile, 'xs-worker');
 }

 run().catch(err => console.log('RUN ERR', err));
dckc commented 3 years ago

not enough transcript entries to reproduce the bug?

The log above shows "deliveryNum":64801 but the kernel DB seems to be short 9 entries:

warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$ node bin/extract-transcript-from-kerneldb.js ~/.ag-chain-cosmos/data/ag-cosmos-chain-state/
all vats:
v1 : bank       (64883 deliveries)
...
v10 : zoe       (64792 deliveries)

p.s. replay from snapshot

p.s. oh: the problem here was when restarting from a snapshot. replay-transcript.js doesn't know how to do that, yet...