Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

Suspected liveslots sensitivity to engine allocations #6784

Open mhofman opened 1 year ago

mhofman commented 1 year ago

Describe the bug

While validating an upgrade to Moddable SDK 3.6.0 from the fork we've been running, I attempted to validate that an upgraded XS version could still execute transcripts of mainnet vats.

Liveslots is supposed to hide any gc effect from the user code, and hide organic gc from the kernel (as captured by transcripts), only revealing the result of a full gc in a deterministic manner when explicitly requested by bringOutYourDead. As such if the JavaScript behavior of the user program is the same, the transcript should be the same.

While we're aware of and are able to work around a gc sensitivity in virtual collections, something is causing other execution differences.

I've narrowed down the triggers of those differences to 5 commits in the Moddable SDK, which have all been confirmed to have an impact on allocations.

To Reproduce

Observed behaviors

warner commented 1 year ago

I looked into a few of these, here are some preliminary notes.

The first one ("XS: fuzzilli 38", vat v24, tnum=1547) is a makeWantMintedInvitation to o+19/1, in which the "expected" behavior has syscalls which allocate a new InvitationHandle (instance of durable Kind 12), but the "got" is a state fetch for o+19/1 (the target object). This indicates that liveslots had no Representative or state for the target of the message (I guess the GC variance caused the Representative to get collected early), and when makeWantMintedInvitation needs the state, it had to be paged in.

@FUDCo and I considered this during design/implementation, but it looks like we missed something. Our intention was that the Representative can be built without doing a vatstore read (the Kind ID is parsed out of the vref), so simply delivering a message to a vobj shouldn't trigger a syscall (or divergence if the Rep is/is-not currently present in the slotToVal table's WeakRef). The state object is built at the same time, but we don't need the actual data until (and unless) the method being invoked tries to do a property lookup, so simple behavior-less marker objects wouldn't need a syscall either. Only when the setter/getter fires, that's when we fetch from the vatstore. And we made it to perform that fetch unconditionally (even if the data is available in our LRU cache), specifically to make the syscalls a deterministic function of userspace behavior (knowing that the reads were redundant: the cache doesn't help).

Hm, I wonder if the write-back cache is involved: if userspace does:

then the 3-read must use the dirty data from the cache, for correctness, but we thought we were doing an unconditional read from the vatstore, for determinism. I don't remember considering the conflict between those two, which means I'm not confident that it does the right thing.

(@FUDCo and I have talked about removing that LRU cache entirely, or making it write-through, in #6693, partly because of how it confuses issues like this)

The last one ("XS profile ID by function definition", vat v6, tnum=12258) is a BringOutYourDead, which (in the transcripts provided) starts with a pair of vatstoreSet calls that flush the dirty contents of the virtual-object LRU cache.

Basically the vatstoreSet of a vom.o+NN/NN key is a LRU cache write, the vatstoreSet of a vom.dking.NN key is incrementing the nextInstanceID counter because a new virtual/durable object has just been allocated, and the vatstoreGet of a vom.rc.o+NN/NN key is checking the virtual reference counts for an object that has presumably just lost its RAM pillar and GC is deciding whether to delete it entirely.

I concur that we have more sensitivity to GC order than we first thought (i.e. #6360 and #6760 do not form an exhaustive list).

mhofman commented 1 year ago

Not sure if it can help, but I have generated new vat transcripts replaying chain activity using an updated version of XS (with different allocation behavior). Here is the diff vs the vat transcripts for the pismoC version of XS (as if that fixed version had been used from agd upgrade). I have the full transcripts if needed as well.

transcript-fixed-vs-updated-diff.tgz

Generated using

for i in {1..54}; do diff -U 5 <(zcat replay-fixed/transcript-v$i.sst.gz | jq -r .) <(cat replay-updated-xs-from-transcript/transcript-v$i.sst | head -n $(zcat replay-fixed/transcript-v$i.sst.gz | wc -l) | jq -r .) > transcript-fixed-vs-updated-v$i.diff ; done
warner commented 1 year ago

We'll use this ticket to track the sensitivies we discover, but we may or may not be able to fix all of them in time for the bulldozer release.

mhofman commented 1 year ago

Update: Moddable landed a patch for us which considers WeakRef strong during organic GC and only empties them during forced GC (which happens either during BYOD or snapshot making, which are both in consensus operations).

That effectively hides organic GC from liveslots, and thus from the kernel. That patch seem to have removed all the anachrophobia reported above. Testing methodology: new vat transcripts generated using chain-replay tool on pismoC + WeakRef GC patch cherry picked (see mhofman/6784-hide-organic-gc-pismo), replay transcripts on latest XS which includes the same WeakRef GC patch (see mhofman/6784-hide-organic-gc-updated-xs).

This should allow us to confidently update XS versions which contain allocation differences without triggering replay issues during node restart. Of course we should still fix the underlying issues in liveslots so we can disable this patch.

mhofman commented 1 year ago

One possible manifestation may be in https://github.com/Agoric/agoric-sdk/issues/7078, even though that issue seems fully non-deterministic, or at least a race between BOYD and vat termination

Edit: that issue was happening in a node based worker, explaining the full non-determinism.

warner commented 2 weeks ago

Not sure if it's relevant, but leaving a pointer just in case: I found that liveslots scanForDeadObjects() was not sorting a list of finalizer-gathered vrefs (the possiblyDeadSet) before processing it. That caused the order of refcount-checking syscall.vatstoreGet done during BOYD to depend on the order in which the engine decided to invoke the finalizer, which is something we had intended to hide. PR #9961 will fix the misbehavior by sorting the list of vrefs before doing the refcount checks.