Closed warner closed 2 months ago
I captured two different slogfiles of non-crashing runs, and they show different GC behavior, which explains why the failure is intermittent: V8 is being weird about GC (our old friend #3240), and some runs see different WeakRef probes or finalization notifications than others. But liveslots is supposed to be tolerant of that, as long as the engine is consistent, so this points to either a bug in liveslots, or some interesting failure mode of the V8 GC sensors and how liveslots is using them.
Possibly related:
I saw the same failure in CI under Node.js v18, ruling out that v20 is the only culprit:
----- RemoteChainFacade.5 2 making a RemoteChainFacade
error during syscall translation: (Error#7)
Error#7: syscall.retireImports but o-54 is still reachable
at eval (.../swingset-vat/src/kernel/vatTranslator.js:437:7)
at Array.map (<anonymous>)
at translateRetireImports (.../swingset-vat/src/kernel/vatTranslator.js:431:19)
at Object.vatSyscallToKernelSyscall (.../swingset-vat/src/kernel/vatTranslator.js:620:8)
at vatSyscallHandler (.../swingset-vat/src/kernel/kernel.js:1563:17)
at syscallHandler (.../swingset-vat/src/kernel/vat-warehouse.js:42:12)
at syscallFromWorker (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:164:11)
at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:72:3)
at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
error during vat dispatch() of bringOutYourDead (Error#8)
Error#8: syscall.retireImports failed: syscall translation error: prepare to die
at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:87:7)
at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
kernel terminating vat v15 (failure=true)
##### KERNEL PANIC: critical vat v15 failed #####
controller.run() failure (Error#9)
Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
not ok 12 - orchestration › restart-contracts › basicFlows %ava-dur=17379ms
# start basicFlows
# making offer
# REJECTED from ava test.serial("basicFlows"): (Error#9)
# Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
# at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
# at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
# at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
# at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
---
name: AssertionError
message: Rejected promise returned by test
values:
'Rejected promise returned by test. Reason:': |-
Error {
message: '#{"#error":"syscall translation error: prepare to die","name":"Error"}',
}
at: |-
terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
...
ok I think I see a pathway by which a voAwareWeakMap
getting deleted could put us in a situation where a Presence vref (o-54
) is in the possiblyRetiredSet
despite there still being a vdata pillar.
The slog trace of the failing run shows v15 doing a BOYD which does three passes through the scanForDeadObjects
loop. On the third pass, we do a recognizer check for o-54 (ir.o-54
) without first doing a refcount check, which I think is the possiblyRetiredSet
scan (the second of the three clauses). It didn't show any previous recognizer checks.
For o-54 to have gotten into possiblyRetiredSet
after the second pass cleared the set, but before the third pass did the check, would require the droppedCollectionRegistry
callback to fire during the gcAndFinalize()
at the top of the third pass. That would call finalizeDroppedCollection()
, which would be routed to voAwareWeakMapDelete()
, which would walk all keys of the weakmap and add their vrefs to possiblyRetiredSet
.
The scenario I'm visualizing is where the vat has a voAwareWeakMap, this o-54 Presence is added as a key, then userspace drops the weakmap. On the next BOYD, the first two passes do their gcAndFinalize()
, but for some weird V8 reason, that's not enough to shake the thing loose. Only on the third pass does it finally get collected, and the finalizer runs. The third pass' possiblyRetiredSet
scan notices that o-54
indeed lacks any recognizers (neither in-RAM weakmap/weakset, nor durable ir.o-54
keys), and adds it to importsToRetire
, despite the fact that o-54 is still reachable (from the o+d11/19
VOM state).
If the first pass was able to shake it loose, I'd expect finalizeDroppedCollection
to be called during the first gcAndFinalize()
, which would cause a recognizer check in the first pass, just before we start deleting the first batch of VOMs. It would still be buggy (we'd be retiring something without first dropping it), but the syscall trace would be different than what I captured from the failing run.
So maybe this condition could be reached without V8 being weird, which raises the concerning possibility that it could be reached from XS. It would require a Presence to be reachable from virtual data and also recognizable from a WeakMap or WeakSet, but not have a RAM pillar. The RAM pillar should be dropped, then a BOYD should be done. Then the WeakMap should be deleted, then we do a second BOYD. (If a single BOYD both processed the RAM-pillar drop and the possiblyRetiredSet
scan, then the !deadSet.has(vref)
clause would skip the isVrefRecognizable()
and the retirement).
I think the fix will be for the possiblyRetiredSet
scan to include a reachability check: if the vref is reachable, it must not be added to importsToRetire
.
I don't understand the !deadSet.has(vref)
clause in that loop. My vague memory is that it exists to avoid duplicating work, but it smells like it's also meant to guard against the not-dead-yet case.
I was able to write a unit test that triggers this.. defintely a liveslots bug.
Ok I think the requirements to trigger the bug are:
recognizable
decref from the deleted weak collection, and emits the retireImports
(which is bogus because the vref is still reachable, from vdata, which is why the vat did not previously emit a dropImports
)I don't yet know if using the vref in a (virtual/durable) WeakStore would trigger the problem.
@michaelfig tells me that @agoric/vow
uses short-lived WeakSet
s to detect resolution cycles, and the new (as of upgrade16) vat-network
uses @agoric/vow
extensively. That probably explains why we didn't see this before.
I think it might also be exacerbated by the reapDirtThreshold.gcKrefs
that I landed on trunk last week (for the benefit of slow-vat-deletion). That probably causes BOYDs to happen faster than before, which might increase the chance that the Presence is collected in a different BOYD than the WeakSet being collected.
More specifically, Vow watch
uses a WeakMap of WeakSet, keyed on a promise watcher (an exo object so virtualized) with the WeakSet containing presences or representiatives of vows.
The "then, the WeakMap/Set is deleted" would be caused by liveslots collecting the watcher in the first BOYD, which would cause the WeakSet to subsequently become collectible by the engine.
So the way I want to fix this is to make the liveslots scanForDeadObjects
loop be safely idempotent: it should be safe, just horribly inefficient, to put every vref (or at least every type: object
vref, so every void
) into both possiblyDeadSet
and possiblyRetiredSet
, and have scanForDeadObjects
recognize the ones that aren't actually dead or retired. Part of the required behavior is that we only retire something if it was already dropped (or marked for drop by the end of the BOYD).
For these short-lived WeakSets and/or WeakMaps as used by vows, what are the keys? What are the expected lifetimes of those keys? Do we expect such keys to have been used as keys in many short-lived WeakSets/WeakMaps?
I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive. Attn @phoddie @patrick-soquet
In general, when we know the WeakSet/WeakMap is expected to have a shorter lifetime than its keys, and its keys will be reused across several such WeakSets/WeakMaps, then we should ask if we should just use a short-lived Set/Map instead. If we can, we should. That's why I am relaxed about the remaining expense of that one case.
So the way I want to fix this is to make the liveslots scanForDeadObjects loop be safely idempotent
I realized I can't do that, at least not for imports (Presences).
If we'd included an additional "import status" key (e.g. vNN.vc.vom.is.o-123
), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing), then we could compare "has RAM pillar" (slotToVal.get(vref)
) OR "has vdata pillar" (vom.rc.${vref}
) against the import status. If the vref was merely-recognizable or missing entirely, but the import status says "reachable", then we emit a syscall.dropImports
and change the import status to "merely-recognizable". If the pillars are missing and we have no recognizers for the vref, but the import status is "merely-recognizable", then we emit a syscall.retireImports
and delete the import status.
We can't really retrofit that on now, and it would cost an extra DB key per vref so I'm not positive we'd be willing to pay that cost anyways. So we've got a constraint to manage: you must not add a vref to possiblyDeadSet
that is not currently reachable. When we arrive in scanForDeadObjects()
, we can rely upon every vref in possiblyDeadSet
having an (implicit) import status of "reachable", and if our reachability check says "no", then we can safely dropImports
it without contradicting any earlier syscalls. Likewise, you must not add a vref to possiblyRetiredSet
that is not at least recognizable, so a recognizable===false result can safely do a syscall.retireImports
.
One other improvement we can make is to remove some of the duplicate refcount checking. For a virtual/durable object (whose RAM pillar is a Representative), the first phase does a .rc.${vref}
and .es.${vref}
check (to see if the object is really dead), and the third phase deletes the VOM (which starts with an assertion of unreachability, which does a second .rc
and .es
check).
So I'm trying to sketch out a clean way to organize scanForDeadObjects
that will avoid some of the duplication and also lead to a cleaner code path. I'm wondering if I can process one vref at a time, instead of handling all of possiblyDeadSet
in one phase, then all of possiblyRetiredSet
in a second phase, then all of deadSet
in the third phase. And I'd like to separate the codepaths for our three kinds of trackable objects (with RAM pillars of Remotables, Representatives, and Presences, which map to vref shapes of o+NN
, o+dKK/NN
, and o-NN
).
It may help to cache the reachability/recognizability check, but we have to be careful about cache invalidation. We could share the check between the first and second phases, but the moment we delete a VOM, we might invalidate the refcounts.
I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive.
These are liveslots virtualized WeakSets used most likely (but not always) with virtual or durable objects as entries, as such the GC profile of XS does not come into play, but instead it's the liveslots gc behavior that does.
If we'd included an additional "import status" key (e.g.
vNN.vc.vom.is.o-123
), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing)
This sounds related to my suggested fix for https://github.com/Agoric/agoric-sdk/issues/9338: track in durable storage the full status of presences and representatives.
@warner the above commit that landed on master (064ff1ad395856111b4d82bb68d8ab92f8d83f12) is the one I'd like you to revert while closing this issue (#9939).
Michael observed an intermittent CI failure with:
I was able to reproduce it locally (and capture slogfiles) by re-running
yarn test test/orchestration/restart-contracts.test.ts
inpackages/boot
on that PR's branch a dozen times.We're still analyzing the slogs, but it looks like liveslots misbehaved, and did a
retireImports
without doing adropImports
first.@mhofman suspects that our use of
WeakRef
probing (rather than relying upon theFinalizationRegistry
's notification) is causing problems under V8 which we wouldn't see under XS, and this test is using local workers, not xsnap.