Closed zenhack closed 2 years ago
Observations:
TestRecvCancel
only ever hangs when run in parallel with other tests. It passes consistently if you either do go test -run TestRecvCancel
or delete the t.Parallel()
at the top of the function. I don't yet understand why that would happen; it doesn't look like there are any shared resources with other tests that might interfere, and there's no timeouts or anything...It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them.
Also, I managed to trigger a failure other than the above:
--- FAIL: TestBootstrapReceiverAnswerRpc (0.00s)
panic: test.capnp:CapArgsTest.call: rpc: connection closed [recovered]
panic: test.capnp:CapArgsTest.call: rpc: connection closed
goroutine 27 [running]:
testing.tRunner.func1.2({0x667820, 0xc000586060})
/usr/lib/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
/usr/lib/go/src/testing/testing.go:1399 +0x39f
panic({0x667820, 0xc000586060})
/usr/lib/go/src/runtime/panic.go:884 +0x212
capnproto.org/go/capnp/v3/rpc.chkfatal(...)
/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:51
capnproto.org/go/capnp/v3/rpc.TestBootstrapReceiverAnswerRpc(0xc0000eed00)
/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:87 +0x4f3
testing.tRunner(0xc0000eed00, 0x6ca630)
/usr/lib/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
/usr/lib/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL capnproto.org/go/capnp/v3/rpc 0.008s
TestHandleReturn_regression
sometimes also fails (without hanging), with:
--- FAIL: TestHandleReturn_regression (0.00s)
--- FAIL: TestHandleReturn_regression/BootstrapWithExpiredContext (0.00s)
level0_test.go:1820:
Error Trace: level0_test.go:1820
Error: Target error should be in err chain:
expected: "context canceled"
in chain: "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
"test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
"rpc: send message: rpc: connection closed"
"send message: rpc: connection closed"
"rpc: connection closed"
"connection closed"
Test: TestHandleReturn_regression/BootstrapWithExpiredContext
FAIL
exit status 1
FAIL capnproto.org/go/capnp/v3/rpc 0.107s
When it hangs instead, it's on line 1816, which is a call to .Struct() on a result.
@lthibault, #320 fixes both failure modes for TestHandleReturn_regression
.
A trace for the TestBootstrapReceiverAnswerRpc
is below. It looks like the client is sending a finish message for the bootstrap it's done with, and then getting confused when it finally gets the return message because it's forgotten about the question entirely. Probably there's somewhere where we're being too aggressive about removing stuff from the questions table and need to leave some record of this. I think we should only ever remove stuff from the table in two places:
I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.
( "Send"
, Message
{ union' =
Message'bootstrap
Bootstrap { questionId = 0 , deprecatedObjectId = Nothing }
}
)
( "Send"
, Message
{ union' =
Message'call
Call
{ questionId = 1
, target =
MessageTarget
{ union' =
MessageTarget'promisedAnswer
PromisedAnswer { questionId = 0 , transform = [] }
}
, interfaceId = 13288942172782596300
, methodId = 0
, params =
Payload
{ content = Nothing
, capTable =
[ CapDescriptor
{ attachedFd = 255
, union' =
CapDescriptor'receiverAnswer
PromisedAnswer { questionId = 0 , transform = [] }
}
]
}
, sendResultsTo =
Call'sendResultsTo' { union' = Call'sendResultsTo'caller }
, allowThirdPartyTailCall = False
}
}
)
( "Send"
, Message
{ union' =
Message'finish Finish { questionId = 0 , releaseResultCaps = True }
}
)
( "Recv"
, Message
{ union' =
Message'return
Return
{ answerId = 0
, releaseParamCaps = False
, union' =
Return'results
Payload
{ content = Nothing
, capTable =
[ CapDescriptor
{ attachedFd = 255 , union' = CapDescriptor'senderHosted 0 }
]
}
}
}
)
( "Send"
, Message
{ union' =
Message'abort
Exception
{ reason = "rpc: incoming return: question 0 does not exist"
, obsoleteIsCallersFault = False
, obsoleteDurability = 0
, type_ = Exception'Type'failed
}
}
)
TestSendCancel
failed in CI: https://github.com/capnproto/go-capnproto2/actions/runs/3238937452/jobs/5307722709#step:5:1483
...I think I've seen that before, so we should investigate that here as well.
TestBootstrapReceiverAnswerRpc
.TestRecvCancel
. I'll tackle the rest tomorrow.TestRecvAbort
.Closing, since all the relevant prs have been merged.
There are a few tests in the RPC package that usually pass, but fail intermittently:
TestRecvCancel
(hangs)TestSendCancel
(maybe; see https://github.com/capnproto/go-capnproto2/issues/318#issuecomment-1276920615)TestHandleReturn_regression
(Sometimes hangs, sometimes actually fails, see https://github.com/capnproto/go-capnproto2/issues/318#issuecomment-1276866593 )TestBootstrapReceiverAnswerRpc
: see https://github.com/capnproto/go-capnproto2/issues/318#issuecomment-1276839048TestRecvAbort
:There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.