Open warner opened 3 years ago
@warner please let me know (offline) how to log in to testnet-monitor-2
.
If an ssh key helps, please use the 1st ssh key in https://github.com/dckc.keys
Also, point me to the tools for running just 1 vat from a slogfile?
p.s. https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/bin/extract-transcript-from-kerneldb.js and https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/bin/replay-transcript.js
perhaps somewhat dusty.
ok... updated extract-transcript-from-kerneldb.js
a90b3e0b8 83c916d61
warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$ node bin/extract-transcript-from-kerneldb.js ~/.ag-chain-cosmos/data/ag-cosmos-chain-state zoe
extracting transcript for vat v10 into transcript-v10.sst
options: {
vatParameters: { zcfBundleName: 'zcf' },
description: 'static name=zoe',
name: 'zoe',
managerType: 'xs-worker'
}
vatParameters: { zcfBundleName: 'zcf' }
64792 transcript entries
warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$ node -r esm ~/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js transcript-v10.sst
argv [ 'transcript-v10.sst' ]
replay-one-vat.js transcript.sst
using transcript transcript-v10.sst
manager created
delivery 3: ["message","o+0",{"method":"buildZoe","args":{"body":"[{\"@qclass\":\"slot\",\"iface\":\"Alleged: vatAdminService\",\"index\":0}]","slots":["o-50"]},"result":"p-60"}]
RUN ERR (Error#1)
Error#1: delivery replay with no transcript
at makeError (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2572:17)
at fail (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2700:20)
at baseAssert (/home/warner/stuff/agoric/agoric-sdk/node_modules/ses/dist/ses.cjs:2718:13)
at Object.replayOneDelivery (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/src/kernel/vatManager/manager-helper.js:166:5)
at replay (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js:151:21)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async run (/home/warner/stuff/agoric/agoric-sdk/packages/SwingSet/bin/replay-transcript.js:170:3)
p.s. we (@michaelfig and I) used these tweaks to get it running:
diff --git a/packages/SwingSet/bin/replay-transcript.js b/packages/SwingSet/bin/replay-transcript.js
index 2b1b76b1e..77aa3f2de 100644
--- a/packages/SwingSet/bin/replay-transcript.js
+++ b/packages/SwingSet/bin/replay-transcript.js
@@ -46,6 +46,7 @@ async function replay(transcriptFile, worker = 'xs-worker') {
const fakeKernelKeeper = {
provideVatKeeper: _vatID => ({
addToTranscript: () => undefined,
+ getLastSnapshot: () => undefined,
}),
};
const kernelSlog = { write() {} };
@@ -127,6 +128,7 @@ async function replay(transcriptFile, worker = 'xs-worker') {
vatConsole: console,
vatParameters,
compareSyscalls,
+ useTranscript: true,
};
const vatSyscallHandler = undefined;
manager = await factory.createFromBundle(
@@ -167,7 +169,7 @@ async function run() {
}
const [transcriptFile] = args;
console.log(`using transcript ${transcriptFile}`);
- await replay(transcriptFile, 'local');
+ await replay(transcriptFile, 'xs-worker');
}
run().catch(err => console.log('RUN ERR', err));
The log above shows "deliveryNum":64801
but the kernel DB seems to be short 9 entries:
warner@testnet-monitor-2:~/ag-3451/packages/SwingSet$ node bin/extract-transcript-from-kerneldb.js ~/.ag-chain-cosmos/data/ag-cosmos-chain-state/
all vats:
v1 : bank (64883 deliveries)
...
v10 : zoe (64792 deliveries)
p.s. oh: the problem here was when restarting from a snapshot. replay-transcript.js
doesn't know how to do that, yet...
On my
testnet-monitor-2
node (follower, non-validator), following theagorictest-16
testnet, the kernel process was killed by the host machine's OOM killer during the processing of block 84107 around 6:44am PDT sat 03-jul-2021. The host machine has 4GiB of RAM and 0.5GiB of swap, and the swingset kernel process (Node.js) was recorded as using 16.5GiB VmSize and 1.876GiB RSS, so it's not particularly surprising that the machine ran out of memory, and the Node.js process was the best target to kill. The OOM killer was provoked by a (failed?) allocation attempt by anxsnap
process; the kernel does not record the arguments, but does record all processes considered for termination, and I can see that the provokingxsnap
was using more memory than any others (VmSize 60591 pages, vs 15610 for most of the rest), so it's likely that vat-zoe was the trigger. The swingset kernel was killed just after finishing a delivery to vat-zoe:But, when restarting this node about 3.5 hours later, the xsnap process hosting vat-zoe segfaulted during replay, which is surprising:
The host
kern.log
recorded a brief detail:The restart-time
chain.slog
finished with:I have a copy of the swingset state (
~/.ag-chain-cosmos/data/ag-cosmos-chain-state/
), so I'm hoping to be able to reproduce this undergdb
, especially because the crash appears to have happened either during xsnap restore-from-snapshot or during the first replayed-transcript delivery, and should not be dependent upon e.g. the rest of the chain feeding in more transactions.Note that the deliveryNum in question (64801) means the kernel would have instructed the xsnap process to write a heap snapshot just after the delivery completed. So it's reasonable to believe that the snapshot write process would have experienced a
malloc
failure (and maybe worse). Indeed the snapshot directory shows a partially-written file:The entire block was aborted, so I'd expect to see the kernel LMDB entry still pointing at the previous snapshot. The transcript
sqlite
file is appended as deliveries are finished, but the starting-offset pointer lives in LMDB and that should still have the same value as it did before the block was started (i.e. the transcript has "future echoes" in the tail, which will be ignored at next startup).This doesn't explain the segfault: in the new launch, the zoe process should have been reloaded from the previous (complete) snapshot, and replayed the transcript from the previous starting point. There would have been fewer than 200 deliveries to replay (because any other zoe deliveries during that block were abandoned too).
cc @dckc @mhofman @FUDCo
priority: medium: we need to understand why this happened, and fix it, but it can wait a week or two, as long as we make sure it doesn't happen during the next testnet exercise