warner commented 3 years ago

Describe the bug

3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.

warner commented 3 years ago

On branch warner-phase3-loadgen (ac4cb3983dbd38354e048ca500deeffd4ce85274), which is trunk at f052169eedcdbaae017e62e9ea5a0a79c37026c7 plus some logging fixes (and running all non-comms vats on XS), I saw the load generator halt when the chain stopped accepting messages. Looking further, the comms vat rejected a message whose internal sequence number was not one greater than the previously-received messages. By examining the slogfile, we found the comms vat was handed the following sequence of deliveries:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v13" and .kd[0] == "message" and .kd[2].method == "receive") | .kd[2].args.body|fromjson | .[0]' | cut -c-500

"717:644:resolve:fulfill:rp-752:ro+237;{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}"
"718:648:deliver:ro+24:getId:rp-757:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"719:648:deliver:ro+71:deposit:rp-758:ro+240;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}]"
"720:648:deliver:ro+70:deposit:rp-759:ro+239;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}]"
"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]"
"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]"
"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]"
"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]"
"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1858053\"}}]"
"730:657:deliver:ro+24:getId:rp-771:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"731:657:deliver:ro+24:getId:rp-772:ro+61;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0}]"
"732:657:deliver:ro+70:getCurrentAmount:rp-773;[]"
"733:657:deliver:ro+77:getUpdateSince:rp-774;[48]"
"734:657:resolve:fulfill:rp-768:ro+241;{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}"
"735:657:resolve:reject:rp-767;{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v2#70001\",\"message\":\"\\\"unknown\\\" connection closed\",\"name\":\"Error\"}"

The gap between 725 and 730 is the proximate cause: comms was expecting 726 but received 730. All further messages will be rejected until it sees 726.

Looking at the vat-tp deliveries in the same slogfile, we it received deliverInboundMessages with batches of the following envelope (outer) sequence numbers:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v14" and .kd[0] == "message" and .kd[2].method == "deliverInboundMessages")' | \
 jq -c '.kd[2].args.body|fromjson | .[1] | map(.[0].digits)'

["709","710","711","712","713"]
["714","715","716","717"]
["718","719","720"]
["721","722","723","724","725","726","727","728","729"]
["730","731","732","733","734"]
["735"]

The crank which delivered messages 721-729 emitted the following syscalls:

$ cat comms-halt/chain-1.slog |jq -c 'select(.crankNum == 5863 and .type == "syscall") | .vsc'

["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"721\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"722\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"723\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"724\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"725\"}]","slots":[]}]
["send","o-50",{"method":"receive","args":{"body":"[\"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]\"]","slots":[]},"result":"p+725"}]
["subscribe","p+725"]
["send","o-50",{"method":"receive","args":{"body":"[\"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]\"]","slots":[]},"result":"p+726"}]
["subscribe","p+726"]
["send","o-50",{"method":"receive","args":{"body":"[\"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]\"]","slots":[]},"result":"p+727"}]
["subscribe","p+727"]
["send","o-50",{"method":"receive","args":{"body":"[\"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]\"]","slots":[]},"result":"p+728"}]
["subscribe","p+728"]
["send","o-50",{"method":"receive","args":{"body":"[\"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\\\"brand\\\":{\\\"@qclass\\\":\\\"slot\\\",\\\"iface\\\":\\\"Alleged: RUN brand\\\",\\\"index\\\":0},\\\"value\\\":{\\\"@qclass\\\":\\\"bigint\\\",\\\"digits\\\":\\\"1858053\\\"}}]\"]","slots":[]},"result":"p+729"}]
["subscribe","p+729"]

The code which executes this crank is:

https://github.com/Agoric/agoric-sdk/blob/f052169eedcdbaae017e62e9ea5a0a79c37026c7/packages/SwingSet/src/vats/vat-tp.js#L170-L180

The syscalls suggest that the loop was interrupted somewhere after the D(mailbox).ackInbound on the loop pass for message number 725, and before the E(i.receiver) on the following pass for message number 726. If it had made it past that E(i.receiver) we'd expect to see a syscall.send() for 726 (although if the worker were somehow interrupted, or the crank ended early, that syscall might not make it out). If it had made it even further, to the D(mailbox) for 726, we'd expect to see the additional callNow, because these are executed right away (whereas E() schedules a syscall.send on some subsequent promise-queue turn).

@FUDCo analyzed a similar trace whose deliverInboundMessages was given a batch of size 1, and it failed to emit any syscalls. This suggests the fault happened before the E() call, rather than after the D().

We're struggling to imagine something that could cause this. We've brainstormed:

If a batch of messages are not internally ordered, the (num > i.highestDelivered) check would bypass some of them (we rely upon the sender to send internally ordered batches, and to send those batches in order, but we tolerate replays), but if so we would expect to see the syscalls happening in a different order
Something clobbers i.highestDelivered during the loop, causing the later messages to be skipped. There are no awaits in this code, so there's no obvious reentrancy hazard. E() queues behavior for later. D() causes an immediate blocking syscall
my wild idea is that somehow the blocking syscall.callNow invoked by D(), as implemented by xsnap (it turns into a blocking read on the message pipe) is interacting badly with the setImmediate we use inside liveslots to sense end-of-crank, and making liveslots think the crank is done early, making the syscalls for the remaining four messages get lost
some sort of metering fault might have interrupted the vat, but the deliver-results showed a normal crank result, not a meter exhaustion error
if num or i.highestDelivered weren't both BigInts, I could imagine a lexicographic (string) comparison breaking things, but the comparison should have been between 726 and 725, which is not e.g. 9 vs 10.

warner commented 3 years ago

@FUDCo and I are copying the kernelDB from a chain run that hit this failure, and are attempting to replay the swingset transcript for the vat in question, to see if it does the same thing upon replay. If so, it suggests that, whatever is going on, it's at least a function of the state of that vat.

The particular failure we're investigating is in a run we've named bug3021-05may. It occurs in a deliverInboundMessages call to vattp which receives a batch of seqnums 9239+9240+9241, but only emits syscalls for 9239 before finishing.

The "in-vivo" replace I did (by just restarting the chain, after adding a log message to show when syscalls are being made during replay, and replacing the kernelBundle key of the DB so the addition would take effect) shows that the missing syscalls are also missing during replay. Yay! This suggests that we might be able to reproduce it "in-vitro": @FUDCo is working on that with swingset-runner. That'd be great, because restarting the chain with it's current DB takes 18 minutes, and swingset-runner ought to be somewhat faster. Especially if we modify the DB a little more and make it forget about the vats we don't care about.

warner commented 3 years ago

We're able to reproduce the problem in a somewhat-reduced test environment, where we copy the kernelDB from the failing chain, strip out all the vats except for vattp, and re-run it under swingset-runner. This takes about 2.5min to reproduce, whereas the original chain needed 12 hours of once-a-minute traffic to break.

The problem manifests at a simple forEach loop, working on an Array of either 1 or 3 entries (which came from marshal, so ultimately from JSON.parse), which, sometimes, exits early for no good reason.

The fun part is that the crank which exhibits the error changes as a function of the code that xsnap is given (supervisor, liveslots, and vat code). We updated our transcript replay code to not only sense if a vat makes a syscall which differs from the ones recorded in the transcript, but also it fails to make one of the recorded ones. If we run the original code, we see no transcript mismatches, because the crank that showed the error managed to show exactly the same error in the replay. If we modify the code a bit, and the bug happens during some different (earlier) crank, then we see fewer syscalls than we expected, and the replay terminates early with an error.

The good news is that, from what I can tell, it's a deterministic function of that code, so at least the bug is stable, but it tends to move or go away if you add debugging code. It seems to be insensitive to whitespace.

Our current theory is some sort of memory allocation bug, maybe during GC. Maybe something which is causing the short array to get truncated, or to corrupt the variable holding i.highestDelivered. It took a long time to manifest (12 hours, tens of thousands of messages), and happened in the one XS vat that performs the most work, which means there was more allocation (and more of everything) there.

dckc commented 3 years ago

The most concise way I have found for packaging up such things is a stand-alone Makefile that requires only that node and yarn are installed; for example:

https://gist.github.com/dckc/19451649697bb7cb231ab5a776fde0d1#file-makefile

p.s. This assumes the bug will show up on machines other than your own. We've seen bugs where the symptoms were hard to reproduce on another machine; the culprit seemed to be that paths were different between machines and the paths made their way into the runtime heap.

warner commented 3 years ago

Yeah, that's where I'll aim. At @FUDCo 's advice, I'm going to start big+fast (ala "I regret this letter is so long, I did not have the time to make it shorter"), and defer refinement/reduction until after we've delivered something sufficient for them to reproduce it.

dckc commented 3 years ago

When I tried on Monday to reproduce this, I got anachrophobia strikes vat v14; I learned that this means that I did not manage to reproduce it.

dckc commented 3 years ago

@warner Moddable sent us a WeakMap GC fix; I checked that in as https://github.com/agoric-labs/moddable/commit/c533fb7f6031c0e79285fd51e89f3478aad3f8f1 ( on a https://github.com/agoric-labs/moddable/tree/weak-gc-fix branch) and I created a https://github.com/Agoric/agoric-sdk/tree/3039-vattp-xs-gc branch that is master as of May 5 notede above ( f052169 ) with the moddable submodule set to the GC fix.

Care to try it out? Or maybe I could figure out how...

erights commented 3 years ago

Which weakmap gc problem is this supposed to fix?

dckc commented 3 years ago

Which weakmap gc problem is this supposed to fix?

Er... The one where vattp sometimes drops messages, as detailed at length above.

rowgraus commented 3 years ago

In today's testnet meeting: BW: we have a change that's supposed to fix this but we haven't verified it DC: right; I think it's fixed but I'm not certain. BW: probably ok to release without further work on this

Deprioritizing for now

dckc commented 3 years ago

@warner do I have access to the bug3021-05may logs? If not, please use the 1st key in https://github.com/dckc.keys to grant me access?

And please sketch how to use swingset-runner to reproduce the problem "in vitrio"?

Agoric / agoric-sdk

vattp sometimes drops messages? #3039

Describe the bug

3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.