Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
303 stars 191 forks source link

fix(swingset): retire unreachable orphans #9604

Closed warner closed 3 days ago

warner commented 3 days ago

fix(swingset): retire unreachable orphans

If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called syscall.abandonExports, or was upgraded and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf.

We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing koNN.refCount) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj.

closes #7212

cloudflare-pages[bot] commented 3 days ago

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5e34f0f
Status:⚡️  Build in progress...

View logs

warner commented 3 days ago

Security Considerations

none

Scaling Considerations

none

Documentation Considerations

None needed, this is internal to the kernel. No userspace-reliant behavior is changing (userspace vats cannot sense GC anyways).

Testing Considerations

The swingset unit tests included in the PR should be sufficient.

Upgrade Considerations

This changes kernel behavior, to avoid leaking object state. Deployed kernels may have encountered this situation already, in which case they will still have non-retired abandoned non-reachable objects. This PR makes no attempt to remediate old state like that: those objects will continue to exist until the recognizing vat is terminated or chooses to stop recognizing the object (with syscall.retireImports, which is unlikely).

This is merely a storage consumption concern, not a correctness concern. The importing (recognizing) vat doesn't know or care why the object remains unretired: perhaps some other vat can still reach it, or the exporting vat itself. And the previously-exporting vat is either dead or knows nothing about the object, so it doesn't care either. The kernel is the only one in a position to retire the object. The consequence is a few extra DB entries, and whatever is being retained by the WeakMap or WeakMapStore in the importing vat.

According to the notes in the original issue (#7212), the mainnet kernel has only one object this this state (as of six months ago), and the state it keeps alive is not significant. However, when we delete the old price-feed vats, it will abandon many hundreds of thousands of non-reachable objects, so it is important that we land this fix before deleting those vats.

warner commented 3 days ago

abandoning this one in favor of the original #8695

mergify[bot] commented 3 days ago

⚠️ The sha of the head commit of this PR conflicts with #8695. Mergify cannot evaluate rules on this PR. ⚠️