Open mhofman opened 1 year ago
I looked into a few of these, here are some preliminary notes.
The first one ("XS: fuzzilli 38", vat v24, tnum=1547) is a makeWantMintedInvitation to o+19/1, in which the "expected" behavior has syscalls which allocate a new InvitationHandle (instance of durable Kind 12), but the "got" is a state fetch for o+19/1 (the target object). This indicates that liveslots had no Representative or state for the target of the message (I guess the GC variance caused the Representative to get collected early), and when makeWantMintedInvitation
needs the state, it had to be paged in.
@FUDCo and I considered this during design/implementation, but it looks like we missed something. Our intention was that the Representative can be built without doing a vatstore read (the Kind ID is parsed out of the vref), so simply delivering a message to a vobj shouldn't trigger a syscall (or divergence if the Rep is/is-not currently present in the slotToVal
table's WeakRef). The state
object is built at the same time, but we don't need the actual data until (and unless) the method being invoked tries to do a property lookup, so simple behavior-less marker objects wouldn't need a syscall either. Only when the setter/getter fires, that's when we fetch from the vatstore. And we made it to perform that fetch unconditionally (even if the data is available in our LRU cache), specifically to make the syscalls a deterministic function of userspace behavior (knowing that the reads were redundant: the cache doesn't help).
Hm, I wonder if the write-back cache is involved: if userspace does:
then the 3-read must use the dirty data from the cache, for correctness, but we thought we were doing an unconditional read from the vatstore, for determinism. I don't remember considering the conflict between those two, which means I'm not confident that it does the right thing.
(@FUDCo and I have talked about removing that LRU cache entirely, or making it write-through, in #6693, partly because of how it confuses issues like this)
The last one ("XS profile ID by function definition", vat v6, tnum=12258) is a BringOutYourDead, which (in the transcripts provided) starts with a pair of vatstoreSet
calls that flush the dirty contents of the virtual-object LRU cache.
Basically the vatstoreSet
of a vom.o+NN/NN
key is a LRU cache write, the vatstoreSet
of a vom.dking.NN
key is incrementing the nextInstanceID
counter because a new virtual/durable object has just been allocated, and the vatstoreGet
of a vom.rc.o+NN/NN
key is checking the virtual reference counts for an object that has presumably just lost its RAM pillar and GC is deciding whether to delete it entirely.
I concur that we have more sensitivity to GC order than we first thought (i.e. #6360 and #6760 do not form an exhaustive list).
Not sure if it can help, but I have generated new vat transcripts replaying chain activity using an updated version of XS (with different allocation behavior). Here is the diff vs the vat transcripts for the pismoC version of XS (as if that fixed version had been used from agd upgrade). I have the full transcripts if needed as well.
transcript-fixed-vs-updated-diff.tgz
Generated using
for i in {1..54}; do diff -U 5 <(zcat replay-fixed/transcript-v$i.sst.gz | jq -r .) <(cat replay-updated-xs-from-transcript/transcript-v$i.sst | head -n $(zcat replay-fixed/transcript-v$i.sst.gz | wc -l) | jq -r .) > transcript-fixed-vs-updated-v$i.diff ; done
We'll use this ticket to track the sensitivies we discover, but we may or may not be able to fix all of them in time for the bulldozer release.
Update: Moddable landed a patch for us which considers WeakRef strong during organic GC and only empties them during forced GC (which happens either during BYOD or snapshot making, which are both in consensus operations).
That effectively hides organic GC from liveslots, and thus from the kernel. That patch seem to have removed all the anachrophobia reported above. Testing methodology: new vat transcripts generated using chain-replay tool on pismoC + WeakRef GC patch cherry picked (see mhofman/6784-hide-organic-gc-pismo), replay transcripts on latest XS which includes the same WeakRef GC patch (see mhofman/6784-hide-organic-gc-updated-xs).
This should allow us to confidently update XS versions which contain allocation differences without triggering replay issues during node restart. Of course we should still fix the underlying issues in liveslots so we can disable this patch.
One possible manifestation may be in https://github.com/Agoric/agoric-sdk/issues/7078, even though that issue seems fully non-deterministic, or at least a race between BOYD and vat termination
Edit: that issue was happening in a node based worker, explaining the full non-determinism.
Not sure if it's relevant, but leaving a pointer just in case: I found that liveslots scanForDeadObjects()
was not sorting a list of finalizer-gathered vrefs (the possiblyDeadSet
) before processing it. That caused the order of refcount-checking syscall.vatstoreGet
done during BOYD to depend on the order in which the engine decided to invoke the finalizer, which is something we had intended to hide. PR #9961 will fix the misbehavior by sorting the list of vrefs before doing the refcount checks.
Describe the bug
While validating an upgrade to Moddable SDK 3.6.0 from the fork we've been running, I attempted to validate that an upgraded XS version could still execute transcripts of mainnet vats.
Liveslots is supposed to hide any gc effect from the user code, and hide organic gc from the kernel (as captured by transcripts), only revealing the result of a full gc in a deterministic manner when explicitly requested by bringOutYourDead. As such if the JavaScript behavior of the user program is the same, the transcript should be the same.
While we're aware of and are able to work around a gc sensitivity in virtual collections, something is causing other execution differences.
I've narrowed down the triggers of those differences to 5 commits in the Moddable SDK, which have all been confirmed to have an impact on allocations.
To Reproduce
release-pismo
with the Moddable SDK updated to 3.6.0 and select reverts on top to make all transcripts run as they initially did.moddable
submodule to trigger the corresponding error. Please note thatyarn build
inpackages/xsnap
automatically restores the moddable submodule commit, unless it's added to the index. Inpackages/xsnap/moddable
:git reset --hard mhofman/sync-sdk-keep-behavior
git revert $commitsha
(cd .. && git add moddable && yarn build)
/path/to/agoric-sdk/packages/SwingSet/misc-tools/replay-transcript.js --no-forced-reload-from-snapshot transcript-v$i.sst
Observed behaviors
(partial results)