Open warner opened 3 years ago
On branch warner-phase3-loadgen
(ac4cb3983dbd38354e048ca500deeffd4ce85274), which is trunk at f052169eedcdbaae017e62e9ea5a0a79c37026c7 plus some logging fixes (and running all non-comms vats on XS), I saw the load generator halt when the chain stopped accepting messages. Looking further, the comms vat rejected a message whose internal sequence number was not one greater than the previously-received messages. By examining the slogfile, we found the comms vat was handed the following sequence of deliveries:
$ cat comms-halt/chain-1.slog| \
jq -c 'select(.type == "deliver" and .vatID == "v13" and .kd[0] == "message" and .kd[2].method == "receive") | .kd[2].args.body|fromjson | .[0]' | cut -c-500
"717:644:resolve:fulfill:rp-752:ro+237;{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}"
"718:648:deliver:ro+24:getId:rp-757:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"719:648:deliver:ro+71:deposit:rp-758:ro+240;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}]"
"720:648:deliver:ro+70:deposit:rp-759:ro+239;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}]"
"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]"
"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]"
"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]"
"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]"
"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1858053\"}}]"
"730:657:deliver:ro+24:getId:rp-771:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"731:657:deliver:ro+24:getId:rp-772:ro+61;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0}]"
"732:657:deliver:ro+70:getCurrentAmount:rp-773;[]"
"733:657:deliver:ro+77:getUpdateSince:rp-774;[48]"
"734:657:resolve:fulfill:rp-768:ro+241;{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}"
"735:657:resolve:reject:rp-767;{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v2#70001\",\"message\":\"\\\"unknown\\\" connection closed\",\"name\":\"Error\"}"
The gap between 725 and 730 is the proximate cause: comms was expecting 726 but received 730. All further messages will be rejected until it sees 726.
Looking at the vat-tp deliveries in the same slogfile, we it received deliverInboundMessages
with batches of the following envelope (outer) sequence numbers:
$ cat comms-halt/chain-1.slog| \
jq -c 'select(.type == "deliver" and .vatID == "v14" and .kd[0] == "message" and .kd[2].method == "deliverInboundMessages")' | \
jq -c '.kd[2].args.body|fromjson | .[1] | map(.[0].digits)'
["709","710","711","712","713"]
["714","715","716","717"]
["718","719","720"]
["721","722","723","724","725","726","727","728","729"]
["730","731","732","733","734"]
["735"]
The crank which delivered messages 721-729 emitted the following syscalls:
$ cat comms-halt/chain-1.slog |jq -c 'select(.crankNum == 5863 and .type == "syscall") | .vsc'
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"721\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"722\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"723\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"724\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"725\"}]","slots":[]}]
["send","o-50",{"method":"receive","args":{"body":"[\"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]\"]","slots":[]},"result":"p+725"}]
["subscribe","p+725"]
["send","o-50",{"method":"receive","args":{"body":"[\"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]\"]","slots":[]},"result":"p+726"}]
["subscribe","p+726"]
["send","o-50",{"method":"receive","args":{"body":"[\"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]\"]","slots":[]},"result":"p+727"}]
["subscribe","p+727"]
["send","o-50",{"method":"receive","args":{"body":"[\"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]\"]","slots":[]},"result":"p+728"}]
["subscribe","p+728"]
["send","o-50",{"method":"receive","args":{"body":"[\"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\\\"brand\\\":{\\\"@qclass\\\":\\\"slot\\\",\\\"iface\\\":\\\"Alleged: RUN brand\\\",\\\"index\\\":0},\\\"value\\\":{\\\"@qclass\\\":\\\"bigint\\\",\\\"digits\\\":\\\"1858053\\\"}}]\"]","slots":[]},"result":"p+729"}]
["subscribe","p+729"]
The code which executes this crank is:
The syscalls suggest that the loop was interrupted somewhere after the D(mailbox).ackInbound
on the loop pass for message number 725, and before the E(i.receiver)
on the following pass for message number 726. If it had made it past that E(i.receiver)
we'd expect to see a syscall.send()
for 726 (although if the worker were somehow interrupted, or the crank ended early, that syscall might not make it out). If it had made it even further, to the D(mailbox)
for 726, we'd expect to see the additional callNow
, because these are executed right away (whereas E()
schedules a syscall.send
on some subsequent promise-queue turn).
@FUDCo analyzed a similar trace whose deliverInboundMessages
was given a batch of size 1, and it failed to emit any syscalls. This suggests the fault happened before the E()
call, rather than after the D()
.
We're struggling to imagine something that could cause this. We've brainstormed:
(num > i.highestDelivered)
check would bypass some of them (we rely upon the sender to send internally ordered batches, and to send those batches in order, but we tolerate replays), but if so we would expect to see the syscalls happening in a different orderi.highestDelivered
during the loop, causing the later messages to be skipped. There are no await
s in this code, so there's no obvious reentrancy hazard. E()
queues behavior for later. D()
causes an immediate blocking syscallsyscall.callNow
invoked by D()
, as implemented by xsnap
(it turns into a blocking read on the message pipe) is interacting badly with the setImmediate
we use inside liveslots to sense end-of-crank, and making liveslots think the crank is done early, making the syscalls for the remaining four messages get lostdeliver-results
showed a normal crank result, not a meter exhaustion errornum
or i.highestDelivered
weren't both BigInts, I could imagine a lexicographic (string) comparison breaking things, but the comparison should have been between 726
and 725
, which is not e.g. 9
vs 10
.@FUDCo and I are copying the kernelDB from a chain run that hit this failure, and are attempting to replay the swingset transcript for the vat in question, to see if it does the same thing upon replay. If so, it suggests that, whatever is going on, it's at least a function of the state of that vat.
The particular failure we're investigating is in a run we've named bug3021-05may
. It occurs in a deliverInboundMessages
call to vattp which receives a batch of seqnums 9239+9240+9241, but only emits syscalls for 9239 before finishing.
The "in-vivo" replace I did (by just restarting the chain, after adding a log message to show when syscalls are being made during replay, and replacing the kernelBundle
key of the DB so the addition would take effect) shows that the missing syscalls are also missing during replay. Yay! This suggests that we might be able to reproduce it "in-vitro": @FUDCo is working on that with swingset-runner
. That'd be great, because restarting the chain with it's current DB takes 18 minutes, and swingset-runner ought to be somewhat faster. Especially if we modify the DB a little more and make it forget about the vats we don't care about.
We're able to reproduce the problem in a somewhat-reduced test environment, where we copy the kernelDB from the failing chain, strip out all the vats except for vattp
, and re-run it under swingset-runner
. This takes about 2.5min to reproduce, whereas the original chain needed 12 hours of once-a-minute traffic to break.
The problem manifests at a simple forEach
loop, working on an Array of either 1 or 3 entries (which came from marshal
, so ultimately from JSON.parse
), which, sometimes, exits early for no good reason.
The fun part is that the crank which exhibits the error changes as a function of the code that xsnap is given (supervisor, liveslots, and vat code). We updated our transcript replay code to not only sense if a vat makes a syscall which differs from the ones recorded in the transcript, but also it fails to make one of the recorded ones. If we run the original code, we see no transcript mismatches, because the crank that showed the error managed to show exactly the same error in the replay. If we modify the code a bit, and the bug happens during some different (earlier) crank, then we see fewer syscalls than we expected, and the replay terminates early with an error.
The good news is that, from what I can tell, it's a deterministic function of that code, so at least the bug is stable, but it tends to move or go away if you add debugging code. It seems to be insensitive to whitespace.
Our current theory is some sort of memory allocation bug, maybe during GC. Maybe something which is causing the short array to get truncated, or to corrupt the variable holding i.highestDelivered
. It took a long time to manifest (12 hours, tens of thousands of messages), and happened in the one XS vat that performs the most work, which means there was more allocation (and more of everything) there.
The most concise way I have found for packaging up such things is a stand-alone Makefile
that requires only that node
and yarn
are installed; for example:
https://gist.github.com/dckc/19451649697bb7cb231ab5a776fde0d1#file-makefile
p.s. This assumes the bug will show up on machines other than your own. We've seen bugs where the symptoms were hard to reproduce on another machine; the culprit seemed to be that paths were different between machines and the paths made their way into the runtime heap.
Yeah, that's where I'll aim. At @FUDCo 's advice, I'm going to start big+fast (ala "I regret this letter is so long, I did not have the time to make it shorter"), and defer refinement/reduction until after we've delivered something sufficient for them to reproduce it.
When I tried on Monday to reproduce this, I got anachrophobia strikes vat v14
; I learned that this means that I did not manage to reproduce it.
@warner Moddable sent us a WeakMap GC fix; I checked that in as https://github.com/agoric-labs/moddable/commit/c533fb7f6031c0e79285fd51e89f3478aad3f8f1 ( on a https://github.com/agoric-labs/moddable/tree/weak-gc-fix branch) and I created a https://github.com/Agoric/agoric-sdk/tree/3039-vattp-xs-gc branch that is master as of May 5 notede above ( f052169 ) with the moddable submodule set to the GC fix.
Care to try it out? Or maybe I could figure out how...
Which weakmap gc problem is this supposed to fix?
Which weakmap gc problem is this supposed to fix?
Er... The one where vattp sometimes drops messages, as detailed at length above.
In today's testnet meeting: BW: we have a change that's supposed to fix this but we haven't verified it DC: right; I think it's fixed but I'm not certain. BW: probably ok to release without further work on this
Deprioritizing for now
@warner do I have access to the bug3021-05may
logs? If not, please use the 1st key in https://github.com/dckc.keys to grant me access?
And please sketch how to use swingset-runner
to reproduce the problem "in vitrio"?
Describe the bug
3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.