capnproto / go-capnp

Cap'n Proto library and code generator for Go
https://capnproto.org
Other
1.23k stars 111 forks source link

Flaky RPC tests #318

Closed zenhack closed 2 years ago

zenhack commented 2 years ago

There are a few tests in the RPC package that usually pass, but fail intermittently:

level0_test.go:148: 
            Error Trace:    level0_test.go:148
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestRecvAbort
            Messages:       should resolve bootstrap capability

There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.

zenhack commented 2 years ago

Observations:

It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them.


Also, I managed to trigger a failure other than the above:

--- FAIL: TestBootstrapReceiverAnswerRpc (0.00s)
panic: test.capnp:CapArgsTest.call: rpc: connection closed [recovered]
    panic: test.capnp:CapArgsTest.call: rpc: connection closed

goroutine 27 [running]:
testing.tRunner.func1.2({0x667820, 0xc000586060})
    /usr/lib/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
    /usr/lib/go/src/testing/testing.go:1399 +0x39f
panic({0x667820, 0xc000586060})
    /usr/lib/go/src/runtime/panic.go:884 +0x212
capnproto.org/go/capnp/v3/rpc.chkfatal(...)
    /home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:51
capnproto.org/go/capnp/v3/rpc.TestBootstrapReceiverAnswerRpc(0xc0000eed00)
    /home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:87 +0x4f3
testing.tRunner(0xc0000eed00, 0x6ca630)
    /usr/lib/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
    /usr/lib/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL    capnproto.org/go/capnp/v3/rpc   0.008s
zenhack commented 2 years ago

TestHandleReturn_regression sometimes also fails (without hanging), with:

--- FAIL: TestHandleReturn_regression (0.00s)
    --- FAIL: TestHandleReturn_regression/BootstrapWithExpiredContext (0.00s)
        level0_test.go:1820: 
                Error Trace:    level0_test.go:1820
                Error:          Target error should be in err chain:
                                expected: "context canceled"
                                in chain: "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
                                    "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
                                    "rpc: send message: rpc: connection closed"
                                    "send message: rpc: connection closed"
                                    "rpc: connection closed"
                                    "connection closed"
                Test:           TestHandleReturn_regression/BootstrapWithExpiredContext
FAIL
exit status 1
FAIL    capnproto.org/go/capnp/v3/rpc   0.107s

When it hangs instead, it's on line 1816, which is a call to .Struct() on a result.

zenhack commented 2 years ago

@lthibault, #320 fixes both failure modes for TestHandleReturn_regression.

zenhack commented 2 years ago

A trace for the TestBootstrapReceiverAnswerRpc is below. It looks like the client is sending a finish message for the bootstrap it's done with, and then getting confused when it finally gets the return message because it's forgotten about the question entirely. Probably there's somewhere where we're being too aggressive about removing stuff from the questions table and need to leave some record of this. I think we should only ever remove stuff from the table in two places:

  1. handleReturn
  2. when sending the initial question fails (and thus we won't receive a return)

I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.

( "Send"
, Message
    { union' =
        Message'bootstrap
          Bootstrap { questionId = 0 , deprecatedObjectId = Nothing }
    }
)
( "Send"
, Message
    { union' =
        Message'call
          Call
            { questionId = 1
            , target =
                MessageTarget
                  { union' =
                      MessageTarget'promisedAnswer
                        PromisedAnswer { questionId = 0 , transform = [] }
                  }
            , interfaceId = 13288942172782596300
            , methodId = 0
            , params =
                Payload
                  { content = Nothing
                  , capTable =
                      [ CapDescriptor
                          { attachedFd = 255
                          , union' =
                              CapDescriptor'receiverAnswer
                                PromisedAnswer { questionId = 0 , transform = [] }
                          }
                      ]
                  }
            , sendResultsTo =
                Call'sendResultsTo' { union' = Call'sendResultsTo'caller }
            , allowThirdPartyTailCall = False
            }
    }
)
( "Send"
, Message
    { union' =
        Message'finish Finish { questionId = 0 , releaseResultCaps = True }
    }
)
( "Recv"
, Message
    { union' =
        Message'return
          Return
            { answerId = 0
            , releaseParamCaps = False
            , union' =
                Return'results
                  Payload
                    { content = Nothing
                    , capTable =
                        [ CapDescriptor
                            { attachedFd = 255 , union' = CapDescriptor'senderHosted 0 }
                        ]
                    }
            }
    }
)
( "Send"
, Message
    { union' =
        Message'abort
          Exception
            { reason = "rpc: incoming return: question 0 does not exist"
            , obsoleteIsCallersFault = False
            , obsoleteDurability = 0
            , type_ = Exception'Type'failed
            }
    }
)
zenhack commented 2 years ago

TestSendCancel failed in CI: https://github.com/capnproto/go-capnproto2/actions/runs/3238937452/jobs/5307722709#step:5:1483

...I think I've seen that before, so we should investigate that here as well.

zenhack commented 2 years ago

322 fixes TestBootstrapReceiverAnswerRpc.

zenhack commented 2 years ago

324 fixes TestRecvCancel. I'll tackle the rest tomorrow.

zenhack commented 2 years ago

327 fixes TestRecvAbort.

zenhack commented 2 years ago

Closing, since all the relevant prs have been merged.