Open toliaqat opened 2 weeks ago
One starting point here is just to look at all the console.log
messages emitted by the vats during upgrade. These are copied to the validator's stdout (or maybe stderr), but they're kinda buried among cosmos-level messages ("I received a block, yay!" "I voted!" "I committed!" "I'm trying to talk to peer 123" etc), as well as serialized Error objects from our marshalling layer, and non-vat-specific kernel logs like "swingset started".
Given a slogfile, the following jq
command will emit just the vat's console.log
strings:
% cat slogfile_1720030203-copy.json |jq -c 'select(.type=="console") | [.vatID, .args]'
eQueueLengthMax\\\":543,\\\"promiseQueuesLength\\\":0,\\\"promiseQueuesLengthUp\\\":22991,\\\"promiseQueuesLengthDown\\\":22991,\\\"promiseQueuesLengthMax\\\":72}\"],[\"kv.kp5616118.refCount\",\"1\"],[\"kv.runQueue\",\"[23924840,23925341]\"],[\"kv.runQueue.23924839\"],[\"kv.v57.c.kp5616118\"],[\"kv.v57.c.p+9587\"],[\"kv.v57.reapCountdown\",\"335\"],[\"transcript.v57.current\",\"{\\\"vatID\\\":\\\"v57\\\",\\\"startPos\\\":11115,\\\"endPo["v1",["evaluateBundleCap","{ manifestBundleRef: { bundleID: 'b1-5d2a103ffa584596135c8b2eec34893516cdefdeec8d953e012969db692d8fdc54e2a8f151c7ea59478c453f6be6b3831c8cfd6b3419e424d9c2eedaeace6f25' }, manifestGetterName: 'getManifestForZoe', vatAdminSvc: Promise [Promise] {} }"]]
["v1",["execute","{ manifestGetterName: 'getManifestForZoe', bundleExports: [ 'getManifestForZoe', 'upgradeZcf' ] }"]]
["v1",["coreProposal:","upgradeZcf"]]
["v1",["ZOE BUNDLE ID: ","b1-c97f59ab2855e0d1d8f3a86532acc26878af0de2217e40cf00c4d8def4f65bb9f2253e4c53444e2bb9d1124a6904f12325d72ee0f22b5882e77424c831348b8d"]]
["v57",["UnhandledPromiseRejectionWarning:","{ incarnationNumber: 1, name: 'vatUpgraded', upgradeMessage: 'vat upgraded' }"]]
["v57",["UnhandledPromiseRejectionWarning:","{ incarnationNumber: 1, name: 'vatUpgraded', upgradeMessage: 'vat upgraded' }"]]
...
During an upgrade test, when our validator/follower node has finished executing the upgrade block, but is waiting for enough other validators to come back on line (i.e. while no new blocks are being produced), we should copy the slogfile so far off to an analysis area, and run that jq
command on it, and share the console.log
messages with the authors of vats involved in the upgrade. Those folks can then chime in on "hey, that log message is surprising" or "hey, why didn't I see the log message I expected".
As an example, during the emerynet upgrade16 review, we saw these console messages:
["v1",["upgradeProvisioningVat..."]]
["v1",["provisioning","{ adminNode: Object [Alleged: adminNode] {}, root: Object [Alleged: undefined] {} }","b1-16b1b19fa622c1cd571fad516a17a2094245d28b10aed783cfe1c0688c7b8a38417e14041310de495d839f0a5cb0d453d3afe8d4d4d6
a89a64b7adb36b91d9f9"]]
["v1",["failed deposit to stATOM-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stOSMO-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stOSMO2-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stkATOM-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["{ name: 'provisioning', opts: { upgradeMessage: 'fix namesByAddress' }, incarnation: { incarnationNumber: 1 } }"]]
["v1",["namesByAddress: new Promise"]]
We concluded that a previous core-eval proposal introduced these four denominations, but their oracles were never registered (since emerynet is pretty stripped-down and lacks all the oracle infrastructure as mainnet). So the agoricNames vat never resolved some of the promises being awaited by the new-denom proposal, and those error messages have been waiting around all this time. Then, in upgrade16, we upgraded vat-agoricNames, causing those promises to be rejected (disconnected), causing those old promises to finally reject, and triggering the error messages. So, not something to worry about, but surprising to see, and it was nice to confirm that it wasn't a problem being caused by u16.
What is the Problem Being Solved?
Brian manually verifies vat upgrades and their initialization post-upgrade by examining the database and logs. We aim to automate this process by developing tests and verification scripts. These automated scripts will run on the upgraded node, providing a summary of post-upgrade activities. The userland team can then review this summary to confirm the upgrade was successful.
Description of the Design
Security Considerations
Scaling Considerations
Test Plan
Upgrade Considerations