Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 208 forks source link

XS behavior divergence during CI #6022

Closed warner closed 1 year ago

warner commented 2 years ago

Describe the bug

Yesterday's CI run (on https://github.com/Agoric/agoric-sdk/pull/6011 to pull in the latest XS release) experienced a failure inside the deployment-test, in which two validators behaved differently:

https://github.com/Agoric/agoric-sdk/runs/7955587167?check_suite_focus=true

The diff reads:

--- /dev/fd/63  2022-08-22 16:06:31.727630864 +0000
+++ /dev/fd/62  2022-08-22 16:06:31.723630598 +0000
@@ -48679,2 +48679,2 @@
-set ko345.refCount 4,4
-set acceptanceQueue.1159 {"type":"send","target":"ko345","msg":{"methargs":{"bo
+set ko249.refCount 4,4
+set acceptanceQueue.1159 {"type":"send","target":"ko249","msg":{"methargs":{"bo
@@ -48706,2 +48706,2 @@
-set ko249.refCount 4,4
-set acceptanceQueue.1162 {"type":"send","target":"ko249","msg":{"methargs":{"bo
+set ko213.refCount 4,4
+set acceptanceQueue.1162 {"type":"send","target":"ko213","msg":{"methargs":{"bo
@@ -48731,2 +48731,2 @@
-set ko213.refCount 4,4
-set acceptanceQueue.1165 {"type":"send","target":"ko213","msg":{"methargs":{"bo
+set ko345.refCount 4,4
+set acceptanceQueue.1165 {"type":"send","target":"ko345","msg":{"methargs":{"bo
@@ -48771 +48771 @@
-set activityhash c2336968be5f3bce3752df7e5d3df32727f8758896005123b0988ec26e233e
+set activityhash 388f6454f586010f36a755c2d86aedebc992e3681041c18e74398b03aec680
@@ -48773 +48773 @@
-set runQueue.1111 {"type":"send","target":"ko345","msg":{"methargs":{"body":"[\
+set runQueue.1111 {"type":"send","target":"ko249","msg":{"methargs":{"body":"[\
@@ -48780 +48780 @@
-set activityhash 68be83bebe01e953e65cea9299a0def4deaa465e89341f525c35e7f8fef215
+set activityhash 33fdf8b639e53a4dd92e9f70d3bfd696d6872de2e5492a29cc72486fb87099
@@ -48789 +48789 @@
-set activityhash a3e45b4df59c3dd99b9e6fafdbc6ef1ad3e86591445e4aa220e64c83aab6c6
+set activityhash 7d1cd73543f6623f21076cd2ac2b8239bece0f462022e719b8e6e57855e66d
@@ -48798 +48798 @@
-set activityhash bc2263926361f48c0be4e9fc9972480668e878fd1393ab9630ebf932fc0a59
+set activityhash 4118e92fc0a931b41c93eeeebad6124a074fb4a1b98cc6e14eb13982266b00
@@ -48800 +48800 @@
-set runQueue.1113 {"type":"send","target":"ko249","msg":{"methargs":{"body":"[\
+set runQueue.1113 {"type":"send","target":"ko213","msg":{"methargs":{"body":"[\
@@ -48807 +48807 @@
-set activityhash bd80d42ff4e22b53068a508a997fd7db3f2693e44984da49e5ed0c3e33cc51
+set activityhash 218c6e7afd3b6e214e6e1ca66566e3d58822de78a2a47e895c3e7344aa1555
@@ -48816 +48816 @@
-set activityhash bc41e6ea95eaa423b4e18bfcc51b7e2d6f05f05323d0ef4ead465d200319b0
+set activityhash 21748fd51b665940c452a8b10997f936588d81ce1d6e3cc7fcc34d1f0577d7
@@ -48825 +48825 @@
-set activityhash 3ed024c425174a9beeeec089d6b912617b4d9be8579ea28a7f165a34de051a
+set activityhash f5095e0bb3cc89d34fac1a7fe0dd733039adeb394090af224f041b54f5c69d
@@ -48827 +48827 @@
-set runQueue.1115 {"type":"send","target":"ko213","msg":{"methargs":{"body":"[\
+set runQueue.1115 {"type":"send","target":"ko345","msg":{"methargs":{"body":"[\
@@ -48834 +48834 @@
-set activityhash 709e45011738910d1d603baf1fec55da97e9474060cdbb566e76fe1f3896ea
+set activityhash b0656df169c002001d2c95fe47d1aa3f1eed383f21a495c5dbf680662a49ce
@@ -48843 +48843 @@
-set activityhash d44d5dffbdd57312106700562332683ca675b08f6eb4edcb55d0741613654c
+set activityhash dd5626ead6ad8a8763e80ab3e2874610db8f0bcbd6147d91064cf9d3b6229d
@@ -48852 +48852 @@
-set activityhash 64d53460be49746324eddb5e8ac21ac98f9995d4b1c1144572a852f7af3d96
+set activityhash 33809577489b45db9561e3c9441c6e02e016b176f0c1a6f6d747a1cebeb40a
@@ -48861 +48861 @@
-set activityhash 5f0967247f31459383f11c80cd5608fa6bf4732cd7103b68b590503e05ba06
+set activityhash 220660cffe0b11c158a061927ccfd751e207b8c8fe7144d6639761c546bca8
@@ -48890 +48890 @@
-set activityhash 0866dea0a88703c8832a47ba1ed2bd95b69a35e2c1a79195f98799f9646c69
+set activityhash 86ccd19f024aec632bed119d46d643286e54735be9507222d11fbe2484c2f8
@@ -48893 +48893 @@
-set host.chainSends [[[2,"{\"method\":\"bindPort\",\"packet\":{\"source_port\":
+set host.chainSends [[[2,"{\"method\":\"bindPort\",\"packet\":{\"source_port\":
Error: Swingstore trace mismatch between validators

I think that indicates that one run observed an object getting garbage collected during a different delivery than the other, so the refcount-modifying syscalls they made appeared in a different order, as well as their syscall.dropImports. This is not supposed to happen, of course.

@mhofman do you happen to know if the CI run products GitHub Actions "artifacts" that we could download, for further study? I know you've investigated divergences like this before, any ideas on how to proceed?

mhofman commented 2 years ago

The artifact should contain kV store trace, xsnap traces, all XS snapshots. I'm not sure if we have the full kernel db in them though

warner commented 2 years ago

cool, thanks, the xsnap traces are probably the most important bit (for Moddable), I'll also see if I can reproduce the divergence locally from the transcript somehow

warner commented 2 years ago

The slogfiles diverge during a dispatch.notify delivery to v19 (a contract vat, not sure which) that resolved a getUpdateSince() Notifier promise to 1661180400, which looks like a timestamp (exactly 8am 22-Aug-2022). The delivery was made about 9:02:05am, so that looks like a TimerNotifier set to run every 2 or more hours.

The very first thing that delivery provokes is a set of three makeCollectFeesInvitation() syscall.sends to three different previously-imported objects (A,B,C). But in the other validator, those messages are sent in a different order: B,C,A.

% diff -u val0.slog val1.slog |head -30
--- val0.slog   2022-08-22 12:58:31.000000000 -0700
+++ val1.slog   2022-08-22 12:59:15.000000000 -0700
@@ -34098,7 +34098,7 @@
 {"type":"clist","crankNum":2270,"mode":"drop","vatID":"v19","kobj":"kp249","vobj":"p+8"}
 {"type":"deliver","crankNum":2270,"vatID":"v19","deliveryNum":36,"replay":false,"kd":["notify",[["kp249",{"state":"fulfilled","data":{"body":"{\"updateCount\":{\"@qclass\":\"bigint\",\"digits\":\"461440\"},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1661180400\"}}","slots":[]}}]]],"vd":["notify",[["p+8",false,{"body":"{\"updateCount\":{\"@qclass\":\"bigint\",\"digits\":\"461440\"},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1661180400\"}}","slots":[]}]]]}
 {"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp623","vobj":"p+21"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksc":["send","ko345",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp623"}],"vsc":["send","o-69",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+21"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksc":["send","ko249",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp623"}],"vsc":["send","o-64",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+21"}]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":0,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
 {"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":1,"replay":false,"ksc":["subscribe","v19","kp623"],"vsc":["subscribe","p+21"]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":1,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
@@ -34113,7 +34113,7 @@
 {"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":5,"replay":false,"ksc":["subscribe","v19","kp625"],"vsc":["subscribe","p+23"]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":5,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
 {"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp626","vobj":"p+24"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksc":["send","ko249",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp626"}],"vsc":["send","o-64",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+24"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksc":["send","ko213",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp626"}],"vsc":["send","o-60",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+24"}]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":6,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
 {"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":7,"replay":false,"ksc":["subscribe","v19","kp626"],"vsc":["subscribe","p+24"]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":7,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
@@ -34128,7 +34128,7 @@
 {"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":11,"replay":false,"ksc":["subscribe","v19","kp628"],"vsc":["subscribe","p+26"]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":11,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
 {"type":"clist","crankNum":2270,"mode":"export","vatID":"v19","kobj":"kp629","vobj":"p+27"}
-{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksc":["send","ko213",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp629"}],"vsc":["send","o-60",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+27"}]}
+{"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksc":["send","ko345",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"kp629"}],"vsc":["send","o-69",{"methargs":{"body":"[\"makeCollectFeesInvitation\",[]]","slots":[]},"result":"p+27"}]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":12,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}
 {"type":"syscall","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":13,"replay":false,"ksc":["subscribe","v19","kp629"],"vsc":["subscribe","p+27"]}
 {"type":"syscall-result","crankNum":2270,"vatID":"v19","deliveryNum":36,"syscallNum":13,"replay":false,"ksr":["ok",null],"vsr":["ok",null]}

The wheels fall off shortly after that: the divergent delivery causes different computron counts in the two validators, and different currentHeapCount values. And then of course the difference in sent messages causes the crankHash to diverge, so the activityHash diverges (and never recovers).

warner commented 2 years ago

Extracting the v19 transcript and replaying it locally results in divergence-from-transcript at the same point: delivery 36. This occurs on both x86 and aarch64, with both the XS/xsnap from trunk and the one from this CI run (PR #6011). And it occurs with both transcripts (the one extracted from the CI run's validator0, and the one from validator1). In both of my local replays (XS/xsnap from trunk, and from the PR) the vat's first syscall was to object C. The replay tool halts at the first sign of divergence, so I don't know what the second and third syscall might have been.

which first second third
local C (o-60) ? ?
CI val-0 A (o-69) B (o-64) C (o-60)
CI val-1 B (o-64) C (o-60) A (o-69)
warner commented 2 years ago

Ugh I suspect https://github.com/Agoric/agoric-sdk/runs/7962038882?check_suite_focus=true (in https://github.com/Agoric/agoric-sdk/pull/6029) is the same thing, but caught by a replaying unit test. And that PR is not using the new XS at all. And it's not reproducing for me locally, of course.

warner commented 2 years ago

Oh, the #6029 failure was our old friend #5575, the GC problem (under v8, not XS), when I downloaded the CI logs, I found the following (in the middle of the log, hard to spot from the web UI)

2022-08-22T22:06:37.0330842Z anachrophobia strikes vat v1 on delivery 5
2022-08-22T22:06:37.0338666Z delivery completed with 4 expected syscalls remaining
2022-08-22T22:06:37.0342386Z expected: {"0":"vatstoreGet","1":"vom.rc.o-54","length":2}
2022-08-22T22:06:37.0343148Z expected: {"0":"vatstoreGetAfter","1":"","2":"vom.ir.o-54|","length":3}
2022-08-22T22:06:37.0343769Z expected: {"0":"dropImports","1":{"0":"o-54","length":1},"length":2}
2022-08-22T22:06:37.0348508Z expected: {"0":"retireImports","1":{"0":"o-54","length":1},"length":2}
2022-08-22T22:06:37.0354730Z REJECTED from ava test: (Error#1)
2022-08-22T22:06:37.0359850Z Error#1: historical inaccuracy in replay of v1
2022-08-22T22:06:37.0364131Z   at Object.finishReplayDelivery (.../swingset-vat/src/kernel/vat-loader/transcript.js:81:13)
mhofman commented 2 years ago

Not sure if it's related but we got hit by another divergence in CI a couple days ago: https://github.com/Agoric/agoric-sdk/pull/6001#issuecomment-1221569551, this time between validator and monitor nodes. It was a divergence in the swing store trace though, and didn't cause a consensus failure, so maybe not related (or maybe not a bug at all, just a wrong assumption in my checks)

warner commented 2 years ago

Ok I'm able to provoke different behavior, on my local machine (aarch64), using the new XS, while replaying the v19 transcript, by either A: performing a snapshot write after deliveryNum=2, or B: not. The behavior (so far) is deterministic within A or B, so I still don't have an explanation for why the two validators behaved differently from each other (but @mhofman maybe you're deliberately restarting one of them to exercise just this sort of problem?).

I think the next step will be to gather the xsnap trace and hand it to the Moddable folks, along with our best guess about what the contract vat is doing. We might be able to speed things up for them by reducing it first, if we get lucky. It might be something like:

const p = E(offvat).getUpdateSince();
p.then(val => E(A).fire());
p.then(val => E(B).fire());
p.then(val => E(C).fire());

and watch for A/B/C to happen in different orders. Except I want to find out exactly when the getUpdateSince was sent, because it might be important that the snapshot write happen in between them.

Tartuffo commented 2 years ago

@arirubinstein Let's discuss whether this is strictly necessary for PSM launch. This is blocking us from updating to the latest Moddable XS SDK.

mhofman commented 1 year ago

Closing as likely fixed by some of the recent Moddable changes. There are 2 remaining suspicious differences in execution on the latest SDK version. The work to fix those and update is tracked at #6759.

dckc commented 1 year ago

I'm struggling to understand the diagnosis here. Does anyone have a brief explanation handy?

I'd like to promote it to the title so that rather than "behavior difference" we have "vanilla found when chocolate expected" or some such.

mhofman commented 1 year ago

The thing is I'm not sure which of the behavior is expected, if any. The only thing we know is that all these workers ought to be doing the same thing when compared to each other, and they did not, hence the divergence observed.

mhofman commented 1 year ago

I will re-open, try to reproduce the divergence, then upgrade to the latest Moddable SDK with reverts of known issues (see #6759), and verify the observed divergence no longer exists.

dckc commented 1 year ago

no need to re-open on my account! Doing all that sounds like "no, it's not handy"

mhofman commented 1 year ago

Not for you. I re-read more closely the symptoms and I'm not sure it's related to the array sort or Map behavior that are now fixed so I'd prefer confirm

mhofman commented 1 year ago

I have done the following to verify this problem is indeed another representation of one of the divergence we've since resolved:

I have not tried to bisect which patch exactly fixes the issue, but I'm satisfied that this problem is fixed and won't be occurring in the upgrade performed by #6768.