Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
322 stars 204 forks source link

slogfile 'crankNum' is duplicated when BOYD+heap-snapshot happens #8264

Open warner opened 1 year ago

warner commented 1 year ago

Describe the bug

While building tools to create a DB from slog data, I discovered that we aren't incrementing the crankNum counter for the BOYD we do just before a heap-snapshot write. The first example of this on mainnet is in the bootstrap block, where we create the first heap-snapshot on v1-bootstrap:

{"type":"crank-start","crankType":"delivery","crankNum":9,"message":{"type":"startVat","vatID":"v4","vatParameters":{"body":"#{}","slots":>
{"type":"deliver","crankNum":9,"vatID":"v4","deliveryNum":1,"replay":false,"kd":["startVat",{"body":"#{}","slots":[]}],"vd":["startVat",{">
{"type":"deliver-result","crankNum":9,"vatID":"v4","deliveryNum":1,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-20","currentHeapC>
{"type":"crank-finish","crankNum":9,"crankhash":"ba08a6b9a98097d4fcd77042d0885968beadc96cb0b83bf4e63791b3790ac6ac","activityhash":"7d056be>
{"type":"crank-start","crankType":"delivery","crankNum":10,"message":{"type":"startVat","vatID":"v5","vatParameters":{"body":"#{}","slots">
{"type":"deliver","crankNum":10,"vatID":"v5","deliveryNum":1,"replay":false,"kd":["startVat",{"body":"#{}","slots":[]}],"vd":["startVat",{>
{"type":"deliver-result","crankNum":10,"vatID":"v5","deliveryNum":1,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-20","currentHeap>
{"type":"crank-finish","crankNum":10,"crankhash":"c657d3664e9db08dd778ceb4daf9dcfb42af642be62d0dbd241817a14a356f3f","activityhash":"b695b9>
{"type":"crank-start","crankType":"delivery","crankNum":11,"message":{"type":"send","target":"ko21","msg":{"methargs":{"body":"#[\"bootstr>
{"type":"deliver","crankNum":11,"vatID":"v1","deliveryNum":2,"replay":false,"kd":["message","ko21",{"methargs":{"body":"#[\"bootstrap\",[{>
{"type":"deliver-result","crankNum":11,"vatID":"v1","deliveryNum":2,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-20","currentHeap>
{"type":"deliver","crankNum":11,"vatID":"v1","deliveryNum":3,"replay":false,"kd":["bringOutYourDead"],"vd":["bringOutYourDead"],"time":168>
{"type":"deliver-result","crankNum":11,"vatID":"v1","deliveryNum":3,"replay":false,"dr":["ok",{"remotableRefCounts":0,"vrefRecognizers":1,>
{"type":"heap-snapshot-load","vatID":"v1","snapPos":4,"hash":"439f91d783827cbbc7043e85ad93d588bbc4552328cb8f2182be0615073d9688","time":168>
{"type":"heap-snapshot-save","vatID":"v1","snapshotID":"439f91d783827cbbc7043e85ad93d588bbc4552328cb8f2182be0615073d9688","uncompressedSiz>
{"type":"crank-finish","crankNum":11,"crankhash":"dd6762d14a0ebbe67e7a853f3bebbd652cef7b3f6be48792f0a451b31dc2ea44","activityhash":"080878>

The first delivery to v1 was a startVat on crankNum: 6. The second is the delivery of the bootstrap method on crankNum: 11. The 'initial snapshot threshold" is 2, which means the bootstrap delivery triggers a heap snapshot sequence. That sequence starts by doing a BOYD delivery, and then writes out the snapshot.

The heap snapshot sequence code did not increment crankNum first, so the slog entries for the BOYD also get crankNum: 11.

This will happen every time we trigger the heap snapshot sequence, which is both on the initial threshold after an incarnation is launched, and then every interval deliveries afterwards. Both the threshold and the interval are configurable as kernel options, but they default to 2 and 100.

The fix will probably be for code in processUpgradeVat to call kernelKeeper.incrementCrankNumber() just after it calls deliverAndLogToVat.

In the meantime, any tools we build to analyze slogfiles must be prepared to handle duplicate cranknums. I'm not sure how they should do that.

warner commented 2 months ago

Hm, my writeup said processUpgradeVat, but I think I meant something else. It does kinda depend upon our definition of a crank. I generally make a distinction between kernel cranks, and vat deliveries. Each vat delivery happens inside a single kernel crank, but not all kernel cranks involve vat deliveries.

During vat upgrades, we'll perform two deliveries: one BOYD to the old worker, then a startVat to the new worker. I think that's ok: we don't need separate crank numbers for them (and it would look weird to have a crank-finish + crank-start in the middle of the processUpgradeVat crank).

The issue I was concerned about is when we take a heap snapshot, which only happens outside of vat upgrades. We make a normal delivery that increments our snapshot counter beyond snapshotInterval (or, once #8980 is done, that increments reapDirt beyond one of the reapDirtThreshold limits), then we deliver a BOYD, then we write out a snapshot, then we load a new worker from that snapshot. The weirdness is that both the normal delivery and the BOYD appear in the same crank.

I'm not positive I feed the need to change that. It's less justifiable than the processUpgradeVat case, since vat-upgrade is fairly indivisible, whereas doing an extra BOYD could conceivably be treated as an extra crank (eg if crossing reapDirtThreshold pushed the vatID onto a high-priority queue, so the very next crank performs a BOYD and a heap snapshot save+load cycle). That wouldn't be a bad way to implement this, and would split the deliveries up into separate cranks, but it's not how we're currently implementing it. (currently, the kernel calls vatWarehouse.maybeSaveSnapshot() as part of processing the CrankResults, just before processing decrementReapCount).

There are currently about 45 lines between maybeSaveSnapshot and the code which increments the crank number and slogs the crank-finish line. If we really wanted to treat this as two separate cranks, we'd need a second (conditional) incrementCrankNumber, along with an extra crank-finish and then a crank-start to match, and we'd want to cycle the activityhash too.. all of which sounds messy.

So I'm going to put this on the back burner, until/unless we decide to change the way maybeSaveSnapshot works (perhaps as part of the snapshot-based-on-computrons part of #6786).