xsnap-zoe segfault during replay of phase4d testnet

warner commented 3 years ago

On my testnet-monitor-2 node (follower, non-validator), following the agorictest-16 testnet, the kernel process was killed by the host machine's OOM killer during the processing of block 84107 around 6:44am PDT sat 03-jul-2021. The host machine has 4GiB of RAM and 0.5GiB of swap, and the swingset kernel process (Node.js) was recorded as using 16.5GiB VmSize and 1.876GiB RSS, so it's not particularly surprising that the machine ran out of memory, and the Node.js process was the best target to kill. The OOM killer was provoked by a (failed?) allocation attempt by an xsnap process; the kernel does not record the arguments, but does record all processes considered for termination, and I can see that the provoking xsnap was using more memory than any others (VmSize 60591 pages, vs 15610 for most of the rest), so it's likely that vat-zoe was the trigger. The swingset kernel was killed just after finishing a delivery to vat-zoe:

$ tail -2 chain.out
{"time":1625319855.8086834,"type":"deliver","crankNum":530461,"vatID":"v10","deliveryNum":64801,"kd":["message","ko51158",{"method":"getPayout","args":{"body":"[\"In\"]","slots":[]},"result":"kp210790"}],"vd":["message","o+29505",{"method":"getPayout","args":{"body":"[\"In\"]","slots":[]},"result":"p-59060"}]}
{"time":1625319856.3706696,"type":"deliver-result","crankNum":530461,"vatID":"v10","deliveryNum":64801,"dr":["ok",null,{"meterType":"xs-meter-8","compute":31654,"allocate":201457696,"allocateChunksCalls":1,"allocateSlotsCalls":40,"garbageCollectionCount":65203,"mapSetAddCount":2757733,"mapSetRemoveCount":94908,"maxBucketSize":381}]}

But, when restarting this node about 3.5 hours later, the xsnap process hosting vat-zoe segfaulted during replay, which is surprising:

2021-07-03T18:42:02.035Z SwingSet: kernel: v15 online: xs-worker static name=timer transcript entries replayed: 58
2021-07-03T18:42:02.476Z SwingSet: kernel: v12 online: xs-worker static name=vatAdmin transcript entries replayed: 62
2021-07-03T18:42:19.238Z SwingSet: kernel: v14 online: xs-worker static name=vattp transcript entries replayed: 193
2021-07-03T18:42:23.873Z SwingSet: kernel: ##### KERNEL PANIC: unable to re-create vat v10 #####
Cannot initialize Controller ExitSignal: v10:zoe exited due to signal SIGSEGV

The host kern.log recorded a brief detail:

Jul  3 18:42:23 testnet-monitor-2 kernel: [326582.134718] show_signal: 5 callbacks suppressed
Jul  3 18:42:23 testnet-monitor-2 kernel: [326582.134722] traps: xsnap[162638] general protection fault ip:7f6a99c235e4 sp:7ffca8dcc6a8 error:0 in libc-2.33.so[7f6a99bb9000+16b000]

The restart-time chain.slog finished with:

{"time":1625337739.2822566,"type":"vat-startup-start","vatID":"v10"}

I have a copy of the swingset state (~/.ag-chain-cosmos/data/ag-cosmos-chain-state/), so I'm hoping to be able to reproduce this under gdb, especially because the crash appears to have happened either during xsnap restore-from-snapshot or during the first replayed-transcript delivery, and should not be dependent upon e.g. the rest of the chain feeding in more transactions.

Note that the deliveryNum in question (64801) means the kernel would have instructed the xsnap process to write a heap snapshot just after the delivery completed. So it's reasonable to believe that the snapshot write process would have experienced a malloc failure (and maybe worse). Indeed the snapshot directory shows a partially-written file:

warner@testnet-monitor-2:~/stuff/agoric/tlg-agorictest-16$ ls -ltr ~/.ag-chain-cosmos/data/ag-cosmos-chain-state/xs-snapshots/|tail -5
-rw-rw-r-- 1 warner warner    333052 Jul  3 13:37 63d7ff546dda12e06063c0b48096e6c6185609a1ac6cd6f41688c21be54413aa.gz
-rw-rw-r-- 1 warner warner   7398527 Jul  3 13:41 f7f61681dfdf6a039cb2a032e1d01859e26af6af9118a846cb316882811eaff4.gz
-rw-rw-r-- 1 warner warner   4348192 Jul  3 13:42 98ed8f602a45ad6cfa7226f96bbe9470b80cdffaad2197757ad70af7b1fe40d6.gz
-rw-rw-r-- 1 warner warner   4387551 Jul  3 13:43 21fe3986f6a9149d2da9b6e2578297b7a29b2bdce72de324db1ff090b4c6c07a.gz
-rw-rw-r-- 1 warner warner 203783515 Jul  3 13:44 save-raw-A5B38x.xss
warner@testnet-monitor-2:~/stuff/agoric/tlg-agorictest-16$

The entire block was aborted, so I'd expect to see the kernel LMDB entry still pointing at the previous snapshot. The transcript sqlite file is appended as deliveries are finished, but the starting-offset pointer lives in LMDB and that should still have the same value as it did before the block was started (i.e. the transcript has "future echoes" in the tail, which will be ignored at next startup).

This doesn't explain the segfault: in the new launch, the zoe process should have been reloaded from the previous (complete) snapshot, and replayed the transcript from the previous starting point. There would have been fewer than 200 deliveries to replay (because any other zoe deliveries during that block were abandoned too).

cc @dckc @mhofman @FUDCo

priority: medium: we need to understand why this happened, and fix it, but it can wait a week or two, as long as we make sure it doesn't happen during the next testnet exercise