Closed warner closed 1 year ago
The artifact should contain kV store trace, xsnap traces, all XS snapshots. I'm not sure if we have the full kernel db in them though
cool, thanks, the xsnap traces are probably the most important bit (for Moddable), I'll also see if I can reproduce the divergence locally from the transcript somehow
The slogfiles diverge during a dispatch.notify
delivery to v19 (a contract vat, not sure which) that resolved a getUpdateSince()
Notifier promise to 1661180400
, which looks like a timestamp (exactly 8am 22-Aug-2022). The delivery was made about 9:02:05am, so that looks like a TimerNotifier set to run every 2 or more hours.
The very first thing that delivery provokes is a set of three makeCollectFeesInvitation()
syscall.sends to three different previously-imported objects (A,B,C). But in the other validator, those messages are sent in a different order: B,C,A.
% diff -u val0.slog val1.slog |head -30
--- val0.slog 2022-08-22 12:58:31.000000000 -0700
+++ val1.slog 2022-08-22 12:59:15.000000000 -0700
@@ -34098,7 +34098,7 @@
{"type":"clist","crankNum":2270,"mode":"drop","vatID":"v19","kobj":"kp249","vobj":"p+8"}
{"type":"deliver","crankNum":2270,"vatID":"v19","deliveryNum":36,"replay":false,"kd":["notify",[["kp249",{"state":"fulfilled","data":{"body":"{\"updateCount\":{\"@qclass\":\"bigint\",\"digits\":\"461440\"},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1661180400\"}}","slots":[]}}]]],"vd":["notify",[["p+8",false,{"body":"{\"updateCount\":{\"@qclass\":\"bigint\",\"digits\":\"461440\"},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1661180400\"}}","slots":[]}]]]}
{"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp623","vobj":"p+21"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksc":["send","ko345",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp623"}],"vsc":["send","o-69",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+21"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksc":["send","ko249",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp623"}],"vsc":["send","o-64",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+21"}]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":1,"replay":false,"ksc":["subscribe","v19","kp623"],"vsc":["subscribe","p+21"]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":1,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
@@ -34113,7 +34113,7 @@
{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":5,"replay":false,"ksc":["subscribe","v19","kp625"],"vsc":["subscribe","p+23"]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":5,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
{"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp626","vobj":"p+24"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksc":["send","ko249",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp626"}],"vsc":["send","o-64",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+24"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksc":["send","ko213",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp626"}],"vsc":["send","o-60",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+24"}]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":7,"replay":false,"ksc":["subscribe","v19","kp626"],"vsc":["subscribe","p+24"]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":7,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
@@ -34128,7 +34128,7 @@
{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":11,"replay":false,"ksc":["subscribe","v19","kp628"],"vsc":["subscribe","p+26"]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":11,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
{"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp629","vobj":"p+27"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksc":["send","ko213",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp629"}],"vsc":["send","o-60",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+27"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksc":["send","ko345",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp629"}],"vsc":["send","o-69",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+27"}]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":13,"replay":false,"ksc":["subscribe","v19","kp629"],"vsc":["subscribe","p+27"]}
{"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":13,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
The wheels fall off shortly after that: the divergent delivery causes different computron counts in the two validators, and different currentHeapCount
values. And then of course the difference in sent messages causes the crankHash to diverge, so the activityHash diverges (and never recovers).
Extracting the v19 transcript and replaying it locally results in divergence-from-transcript at the same point: delivery 36. This occurs on both x86 and aarch64, with both the XS/xsnap from trunk and the one from this CI run (PR #6011). And it occurs with both transcripts (the one extracted from the CI run's validator0
, and the one from validator1
). In both of my local replays (XS/xsnap from trunk, and from the PR) the vat's first syscall was to object C. The replay tool halts at the first sign of divergence, so I don't know what the second and third syscall might have been.
which | first | second | third |
---|---|---|---|
local | C (o-60) | ? | ? |
CI val-0 | A (o-69) | B (o-64) | C (o-60) |
CI val-1 | B (o-64) | C (o-60) | A (o-69) |
Ugh I suspect https://github.com/Agoric/agoric-sdk/runs/7962038882?check_suite_focus=true (in https://github.com/Agoric/agoric-sdk/pull/6029) is the same thing, but caught by a replaying unit test. And that PR is not using the new XS at all. And it's not reproducing for me locally, of course.
Oh, the #6029 failure was our old friend #5575, the GC problem (under v8, not XS), when I downloaded the CI logs, I found the following (in the middle of the log, hard to spot from the web UI)
2022-08-22T22:06:37.0330842Z anachrophobia strikes vat v1 on delivery 5
2022-08-22T22:06:37.0338666Z delivery completed with 4 expected syscalls remaining
2022-08-22T22:06:37.0342386Z expected: {"0":"vatstoreGet","1":"vom.rc.o-54","length":2}
2022-08-22T22:06:37.0343148Z expected: {"0":"vatstoreGetAfter","1":"","2":"vom.ir.o-54|","length":3}
2022-08-22T22:06:37.0343769Z expected: {"0":"dropImports","1":{"0":"o-54","length":1},"length":2}
2022-08-22T22:06:37.0348508Z expected: {"0":"retireImports","1":{"0":"o-54","length":1},"length":2}
2022-08-22T22:06:37.0354730Z REJECTED from ava test: (Error#1)
2022-08-22T22:06:37.0359850Z Error#1: historical inaccuracy in replay of v1
2022-08-22T22:06:37.0364131Z at Object.finishReplayDelivery (.../swingset-vat/src/kernel/vat-loader/transcript.js:81:13)
Not sure if it's related but we got hit by another divergence in CI a couple days ago: https://github.com/Agoric/agoric-sdk/pull/6001#issuecomment-1221569551, this time between validator and monitor nodes. It was a divergence in the swing store trace though, and didn't cause a consensus failure, so maybe not related (or maybe not a bug at all, just a wrong assumption in my checks)
Ok I'm able to provoke different behavior, on my local machine (aarch64), using the new XS, while replaying the v19 transcript, by either A: performing a snapshot write after deliveryNum=2, or B: not. The behavior (so far) is deterministic within A or B, so I still don't have an explanation for why the two validators behaved differently from each other (but @mhofman maybe you're deliberately restarting one of them to exercise just this sort of problem?).
I think the next step will be to gather the xsnap trace and hand it to the Moddable folks, along with our best guess about what the contract vat is doing. We might be able to speed things up for them by reducing it first, if we get lucky. It might be something like:
const p = E(offvat).getUpdateSince();
p.then(val => E(A).fire());
p.then(val => E(B).fire());
p.then(val => E(C).fire());
and watch for A/B/C to happen in different orders. Except I want to find out exactly when the getUpdateSince
was sent, because it might be important that the snapshot write happen in between them.
@arirubinstein Let's discuss whether this is strictly necessary for PSM launch. This is blocking us from updating to the latest Moddable XS SDK.
Closing as likely fixed by some of the recent Moddable changes. There are 2 remaining suspicious differences in execution on the latest SDK version. The work to fix those and update is tracked at #6759.
I'm struggling to understand the diagnosis here. Does anyone have a brief explanation handy?
I'd like to promote it to the title so that rather than "behavior difference" we have "vanilla found when chocolate expected" or some such.
The thing is I'm not sure which of the behavior is expected, if any. The only thing we know is that all these workers ought to be doing the same thing when compared to each other, and they did not, hence the divergence observed.
I will re-open, try to reproduce the divergence, then upgrade to the latest Moddable SDK with reverts of known issues (see #6759), and verify the observed divergence no longer exists.
no need to re-open on my account! Doing all that sounds like "no, it's not handy"
Not for you. I re-read more closely the symptoms and I'm not sure it's related to the array sort or Map behavior that are now fixed so I'd prefer confirm
I have done the following to verify this problem is indeed another representation of one of the divergence we've since resolved:
I have not tried to bisect which patch exactly fixes the issue, but I'm satisfied that this problem is fixed and won't be occurring in the upgrade performed by #6768.
Describe the bug
Yesterday's CI run (on https://github.com/Agoric/agoric-sdk/pull/6011 to pull in the latest XS release) experienced a failure inside the deployment-test, in which two validators behaved differently:
https://github.com/Agoric/agoric-sdk/runs/7955587167?check_suite_focus=true
The
diff
reads:I think that indicates that one run observed an object getting garbage collected during a different delivery than the other, so the refcount-modifying syscalls they made appeared in a different order, as well as their
syscall.dropImport
s. This is not supposed to happen, of course.@mhofman do you happen to know if the CI run products GitHub Actions "artifacts" that we could download, for further study? I know you've investigated divergences like this before, any ideas on how to proceed?