Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
305 stars 194 forks source link

Upgrade verification automation #9652

Open toliaqat opened 2 weeks ago

toliaqat commented 2 weeks ago

What is the Problem Being Solved?

Brian manually verifies vat upgrades and their initialization post-upgrade by examining the database and logs. We aim to automate this process by developing tests and verification scripts. These automated scripts will run on the upgraded node, providing a summary of post-upgrade activities. The userland team can then review this summary to confirm the upgrade was successful.

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

warner commented 2 weeks ago

One starting point here is just to look at all the console.log messages emitted by the vats during upgrade. These are copied to the validator's stdout (or maybe stderr), but they're kinda buried among cosmos-level messages ("I received a block, yay!" "I voted!" "I committed!" "I'm trying to talk to peer 123" etc), as well as serialized Error objects from our marshalling layer, and non-vat-specific kernel logs like "swingset started".

Given a slogfile, the following jq command will emit just the vat's console.log strings:

% cat slogfile_1720030203-copy.json |jq -c 'select(.type=="console") | [.vatID, .args]'
eQueueLengthMax\\\":543,\\\"promiseQueuesLength\\\":0,\\\"promiseQueuesLengthUp\\\":22991,\\\"promiseQueuesLengthDown\\\":22991,\\\"promiseQueuesLengthMax\\\":72}\"],[\"kv.kp5616118.refCount\",\"1\"],[\"kv.runQueue\",\"[23924840,23925341]\"],[\"kv.runQueue.23924839\"],[\"kv.v57.c.kp5616118\"],[\"kv.v57.c.p+9587\"],[\"kv.v57.reapCountdown\",\"335\"],[\"transcript.v57.current\",\"{\\\"vatID\\\":\\\"v57\\\",\\\"startPos\\\":11115,\\\"endPo["v1",["evaluateBundleCap","{ manifestBundleRef: { bundleID: 'b1-5d2a103ffa584596135c8b2eec34893516cdefdeec8d953e012969db692d8fdc54e2a8f151c7ea59478c453f6be6b3831c8cfd6b3419e424d9c2eedaeace6f25' }, manifestGetterName: 'getManifestForZoe', vatAdminSvc: Promise [Promise] {} }"]]
["v1",["execute","{ manifestGetterName: 'getManifestForZoe', bundleExports: [ 'getManifestForZoe', 'upgradeZcf' ] }"]]
["v1",["coreProposal:","upgradeZcf"]]
["v1",["ZOE BUNDLE ID: ","b1-c97f59ab2855e0d1d8f3a86532acc26878af0de2217e40cf00c4d8def4f65bb9f2253e4c53444e2bb9d1124a6904f12325d72ee0f22b5882e77424c831348b8d"]]
["v57",["UnhandledPromiseRejectionWarning:","{ incarnationNumber: 1, name: 'vatUpgraded', upgradeMessage: 'vat upgraded' }"]]
["v57",["UnhandledPromiseRejectionWarning:","{ incarnationNumber: 1, name: 'vatUpgraded', upgradeMessage: 'vat upgraded' }"]]
...

During an upgrade test, when our validator/follower node has finished executing the upgrade block, but is waiting for enough other validators to come back on line (i.e. while no new blocks are being produced), we should copy the slogfile so far off to an analysis area, and run that jq command on it, and share the console.log messages with the authors of vats involved in the upgrade. Those folks can then chime in on "hey, that log message is surprising" or "hey, why didn't I see the log message I expected".

As an example, during the emerynet upgrade16 review, we saw these console messages:

["v1",["upgradeProvisioningVat..."]]
["v1",["provisioning","{ adminNode: Object [Alleged: adminNode] {}, root: Object [Alleged: undefined] {} }","b1-16b1b19fa622c1cd571fad516a17a2094245d28b10aed783cfe1c0688c7b8a38417e14041310de495d839f0a5cb0d453d3afe8d4d4d6
a89a64b7adb36b91d9f9"]]
["v1",["failed deposit to stATOM-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stOSMO-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stOSMO2-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["failed deposit to stkATOM-USD price feed member agoric1zj6vrrrjq4gsyr9lw7dplv4vyejg3p8j2urm82","{ incarnationNumber: 0, name: 'vatUpgraded', upgradeMessage: 'fix namesByAddress' }"]]
["v1",["{ name: 'provisioning', opts: { upgradeMessage: 'fix namesByAddress' }, incarnation: { incarnationNumber: 1 } }"]]
["v1",["namesByAddress: new Promise"]]

We concluded that a previous core-eval proposal introduced these four denominations, but their oracles were never registered (since emerynet is pretty stripped-down and lacks all the oracle infrastructure as mainnet). So the agoricNames vat never resolved some of the promises being awaited by the new-denom proposal, and those error messages have been waiting around all this time. Then, in upgrade16, we upgraded vat-agoricNames, causing those promises to be rejected (disconnected), causing those old promises to finally reject, and triggering the error messages. So, not something to worry about, but surprising to see, and it was nice to confirm that it wasn't a problem being caused by u16.