cashapp / misk

Microservice Kontainer
https://cashapp.github.io/misk/
Apache License 2.0
397 stars 169 forks source link

GrpcClientProviderTest is flaky #1853

Open swankjesse opened 3 years ago

swankjesse commented 3 years ago
GrpcClientProviderTest > happyPath() FAILED
    java.io.IOException: gRPC transport failure (HTTP status=200, grpc-status=null, grpc-message=null)
        at com.squareup.wire.internal.GrpcKt.grpcResponseToException(grpc.kt:166)
        at com.squareup.wire.internal.RealGrpcCall.readExactlyOneAndClose(RealGrpcCall.kt:104)
        at com.squareup.wire.internal.RealGrpcCall.executeBlocking(RealGrpcCall.kt:77)
        at misk.client.GrpcClientProviderTest.happyPath(GrpcClientProviderTest.kt:80)

        Caused by:
        okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL
            at okhttp3.internal.http2.Http2Stream$FramingSource.read(Http2Stream.kt:358)
            at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:276)
            at okio.RealBufferedSource.exhausted(RealBufferedSource.kt:197)
            at com.squareup.wire.internal.GrpcMessageSource.read(GrpcMessageSource.kt:37)
            at com.squareup.wire.internal.GrpcMessageSource.readExactlyOneAndClose(GrpcMessageSource.kt:66)
            at com.squareup.wire.internal.RealGrpcCall.readExactlyOneAndClose(RealGrpcCall.kt:102)
            ... 2 more
carterhudson commented 3 years ago

I'm encountering this in a real environment; any ideas what could be causing it? My current theory is that it's a keep-alive configuration issue.

java.io.IOException: gRPC transport failure (HTTP status=200, grpc-status=null, grpc-message=null)

r3mariano commented 3 years ago

Some more logs from the CI build with extra okhttp logging added:

INFO: [0 ms] callStart: Request{method=POST, url=https://127.0.0.1:39961/test/GetDestinationWarehouse, headers=[te:trailers, grpc-trace-bin:, grpc-accept-encoding:gzip, grpc-encoding:gzip], tags={class com.squareup.wire.GrpcMethod=com.squareup.wire.GrpcMethod@20376e8d}}
INFO: [0 ms] proxySelectStart: https://127.0.0.1:39961/
INFO: [0 ms] proxySelectEnd: [DIRECT]
INFO: [0 ms] dnsStart: 127.0.0.1
INFO: [0 ms] dnsEnd: [/127.0.0.1]
INFO: [0 ms] connectStart: /127.0.0.1:39961 DIRECT
INFO: [1 ms] secureConnectStart
INFO: [17 ms] secureConnectEnd: Handshake{tlsVersion=TLS_1_2 cipherSuite=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 peerCertificates=[CN=misk-server, OU=Server, O=Misk, L=San Francisco, ST=CA, C=US] localCertificates=[]}
INFO: [23 ms] connectEnd: h2
INFO: [23 ms] connectionAcquired: Connection{127.0.0.1:39961, proxy=DIRECT hostAddress=/127.0.0.1:39961 cipherSuite=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 protocol=h2}
INFO: [24 ms] requestHeadersStart
INFO: [24 ms] requestHeadersEnd
INFO: [24 ms] requestBodyStart
INFO: [27 ms] requestBodyEnd: byteCount=30
INFO: [27 ms] responseHeadersStart
INFO: [27 ms] responseHeadersEnd: Response{protocol=h2, code=200, message=, url=https://127.0.0.1:39961/test/GetDestinationWarehouse}
INFO: [27 ms] responseBodyStart
INFO: [27 ms] responseFailed: okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL
INFO: [30 ms] connectionReleased
INFO: [30 ms] callFailed: okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL
INFO: [0 ms] callStart: Request{method=POST, url=http://127.0.0.1:43027/RobotLocator/SayHello, headers=[te:trailers, grpc-trace-bin:, grpc-accept-encoding:gzip, grpc-encoding:gzip], tags={class com.squareup.wire.GrpcMethod=com.squareup.wire.GrpcMethod@2421a0a5}}
INFO: [5 ms] proxySelectStart: http://127.0.0.1:43027/
INFO: [10 ms] proxySelectEnd: [DIRECT]
INFO: [11 ms] dnsStart: 127.0.0.1
INFO: [11 ms] dnsEnd: [/127.0.0.1]
INFO: [12 ms] connectStart: /127.0.0.1:43027 DIRECT
[2021-05-05 05:37:30] >> CONNECTION 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a 
[2021-05-05 05:37:30] >> 0x00000000     6 SETTINGS       
[2021-05-05 05:37:30] >> 0x00000000     4 WINDOW_UPDATE  
[2021-05-05 05:37:30] Q10035 scheduled after   0 µs: OkHttp 127.0.0.1 
[2021-05-05 05:37:30] Q10035 starting              : OkHttp 127.0.0.1 
INFO: [14 ms] connectEnd: h2_prior_knowledge
[2021-05-05 05:37:30] Q10029 scheduled after   0 µs: OkHttp ConnectionPool 
[2021-05-05 05:37:30] Q10029 starting              : OkHttp ConnectionPool 
[2021-05-05 05:37:30] Q10029 run again after 300 s : OkHttp ConnectionPool 
[2021-05-05 05:37:30] Q10029 finished run in 489 µs: OkHttp ConnectionPool 
INFO: [18 ms] connectionAcquired: Connection{127.0.0.1:43027, proxy=DIRECT hostAddress=/127.0.0.1:43027 cipherSuite=none protocol=h2_prior_knowledge}
INFO: [19 ms] requestHeadersStart
[2021-05-05 05:37:30] >> 0x00000003   122 HEADERS       END_HEADERS 
INFO: [20 ms] requestHeadersEnd
[2021-05-05 05:37:30] << 0x00000000    24 SETTINGS       
INFO: [21 ms] requestBodyStart
[2021-05-05 05:37:30] Q10032 scheduled after   0 µs: OkHttp 127.0.0.1 applyAndAckSettings 
[2021-05-05 05:37:30] << 0x00000000     4 WINDOW_UPDATE  
[2021-05-05 05:37:30] Q10032 starting              : OkHttp 127.0.0.1 applyAndAckSettings 
[2021-05-05 05:37:30] << 0x00000000     0 SETTINGS      ACK 
[2021-05-05 05:37:30] Q10034 scheduled after   0 µs: OkHttp 127.0.0.1 onSettings 
[2021-05-05 05:37:30] >> 0x00000000     0 SETTINGS      ACK 
[2021-05-05 05:37:30] Q10034 starting              : OkHttp 127.0.0.1 onSettings 
[2021-05-05 05:37:30] Q10034 finished run in  84 µs: OkHttp 127.0.0.1 onSettings 
[2021-05-05 05:37:30] Q10032 finished run in   5 ms: OkHttp 127.0.0.1 applyAndAckSettings 
[2021-05-05 05:37:30] >> 0x00000003    31 DATA           
[2021-05-05 05:37:30] << 0x00000003    77 HEADERS       END_HEADERS 
[2021-05-05 05:37:30] << 0x00000003    16 DATA           
[2021-05-05 05:37:30] << 0x00000003    12 HEADERS       END_STREAM|END_HEADERS 
[2021-05-05 05:37:30] << 0x00000003     4 RST_STREAM     
INFO: [41 ms] requestBodyEnd: byteCount=31
INFO: [41 ms] responseHeadersStart
INFO: [42 ms] responseHeadersEnd: Response{protocol=h2_prior_knowledge, code=200, message=, url=http://127.0.0.1:43027/RobotLocator/SayHello}
INFO: [46 ms] responseBodyStart
INFO: [51 ms] responseFailed: okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL
[2021-05-05 05:37:30] Q10029 canceled              : OkHttp ConnectionPool 
[2021-05-05 05:37:30] >> 0x00000000     8 GOAWAY         
[2021-05-05 05:37:30] Q10035 finished run in  44 ms: OkHttp 127.0.0.1 
INFO: [59 ms] connectionReleased
INFO: [59 ms] callFailed: okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL

I don't see anything useful here, unfortunately, but I'll put the logs in anyway.

ewolak-sq commented 3 years ago

I looked into this error a bit, and when the tests flake, it's because the server explicitly cancelled the stream. This is happening in Http2Stream#cancelStreamIfNecessary(). Sometimes it thinks it needs to cancel the stream, but when it does, the client gets an unexpected CANCEL and fails the test. The cancel happens because source.finished == false but source.closed == true. When the tests pass, we get source.finished == false && source.closed == false then source.finished == true && source.closed == true, neither of which triggers the CANCEL.

My current theory is that there's a race between FramingSource#close() and Http2Stream#receiveHeaders(). If I so much as add a non-suspending breakpoint to log some state at the top of FramingSource#close(), then it loses the race and everything is fine. In the failure case, #cancelStreamIfNecessary() executes before receiveHeaders(), so closed == true but finished == false and then we try to CANCEL.

This is super-subtle, and I'm getting tired, so I'll revisit later in the week.

Success:

GrpcConnectivityTest > serviceThrowsException() STARTED
Breakpoint reached at okhttp3.internal.http2.Http2Stream.cancelStreamIfNecessary$okhttp(Http2Stream.kt:502)
cancel=false source.finished=false source.closed=false source=okhttp3.internal.http2.Http2Stream$FramingSource@67b5e90e
Breakpoint reached at okhttp3.internal.http2.Http2Stream$FramingSource.close(Http2Stream.kt:478)
source=okhttp3.internal.http2.Http2Stream$FramingSource@67b5e90e
Breakpoint reached at okhttp3.internal.http2.Http2Stream.receiveHeaders(Http2Stream.kt:295)
inFinished=true source=okhttp3.internal.http2.Http2Stream$FramingSource@67b5e90e
Breakpoint reached at okhttp3.internal.http2.Http2Stream.cancelStreamIfNecessary$okhttp(Http2Stream.kt:502)
cancel=false source.finished=true source.closed=true source=okhttp3.internal.http2.Http2Stream$FramingSource@67b5e90e
Disconnected from the target VM, address: 'localhost:55361', transport: 'socket'
GrpcConnectivityTest > serviceThrowsException() PASSED

Failure:

GrpcConnectivityTest > serviceThrowsException() STARTED
Breakpoint reached at okhttp3.internal.http2.Http2Stream.cancelStreamIfNecessary$okhttp(Http2Stream.kt:502)
cancel=false source.finished=false source.closed=false source=okhttp3.internal.http2.Http2Stream$FramingSource@690cfa45
Breakpoint reached at okhttp3.internal.http2.Http2Stream.receiveHeaders(Http2Stream.kt:295)
inFinished=true source=okhttp3.internal.http2.Http2Stream$FramingSource@690cfa45
Breakpoint reached at okhttp3.internal.http2.Http2Stream.cancelStreamIfNecessary$okhttp(Http2Stream.kt:502)
cancel=true source.finished=false source.closed=true source=okhttp3.internal.http2.Http2Stream$FramingSource@690cfa45
Disconnected from the target VM, address: 'localhost:55713', transport: 'socket'
GrpcConnectivityTest > serviceThrowsException() FAILED

7f4249625960b7f10dc61aae4d1896c2e126f538 introduced an optimization that may be connected to this issue.

Stack trace:

close:240, Http2Stream (okhttp3.internal.http2)
cancelStreamIfNecessary$okhttp:506, Http2Stream (okhttp3.internal.http2)
close:488, Http2Stream$FramingSource (okhttp3.internal.http2)
close:34, ForwardingSource (okio)
close:309, Exchange$ResponseBodySource (okhttp3.internal.connection)
close:392, RealBufferedSource (okio)
closeQuietly:491, Util (okhttp3.internal)
close:192, ResponseBody (okhttp3)
serviceThrowsException:125, GrpcConnectivityTest (misk.grpc)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:566, Method (java.lang.reflect)
invokeMethod:686, ReflectionUtils (org.junit.platform.commons.util)
proceed:60, MethodInvocation (org.junit.jupiter.engine.execution)
proceed:131, InvocationInterceptorChain$ValidatingInvocation (org.junit.jupiter.engine.execution)
intercept:149, TimeoutExtension (org.junit.jupiter.engine.extension)
interceptTestableMethod:140, TimeoutExtension (org.junit.jupiter.engine.extension)
interceptTestMethod:84, TimeoutExtension (org.junit.jupiter.engine.extension)
apply:-1, 134748208 (org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor$$Lambda$157)
lambda$ofVoidMethod$0:115, ExecutableInvoker$ReflectiveInterceptorCall (org.junit.jupiter.engine.execution)
apply:-1, 302813371 (org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall$$Lambda$158)
lambda$invoke$0:105, ExecutableInvoker (org.junit.jupiter.engine.execution)
apply:-1, 1743561709 (org.junit.jupiter.engine.execution.ExecutableInvoker$$Lambda$298)
proceed:106, InvocationInterceptorChain$InterceptedInvocation (org.junit.jupiter.engine.execution)
proceed:64, InvocationInterceptorChain (org.junit.jupiter.engine.execution)
chainAndInvoke:45, InvocationInterceptorChain (org.junit.jupiter.engine.execution)
invoke:37, InvocationInterceptorChain (org.junit.jupiter.engine.execution)
invoke:104, ExecutableInvoker (org.junit.jupiter.engine.execution)
invoke:98, ExecutableInvoker (org.junit.jupiter.engine.execution)
lambda$invokeTestMethod$6:205, TestMethodTestDescriptor (org.junit.jupiter.engine.descriptor)
execute:-1, 224454165 (org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor$$Lambda$444)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
invokeTestMethod:201, TestMethodTestDescriptor (org.junit.jupiter.engine.descriptor)
execute:137, TestMethodTestDescriptor (org.junit.jupiter.engine.descriptor)
execute:71, TestMethodTestDescriptor (org.junit.jupiter.engine.descriptor)
lambda$executeRecursively$5:135, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 1332801233 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$230)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$7:125, NodeTestTask (org.junit.platform.engine.support.hierarchical)
invoke:-1, 1146843363 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$229)
around:135, Node (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$8:123, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 270036154 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$228)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
executeRecursively:122, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:80, NodeTestTask (org.junit.platform.engine.support.hierarchical)
accept:-1, 756519367 (org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService$$Lambda$234)
forEach:1540, ArrayList (java.util)
invokeAll:38, SameThreadHierarchicalTestExecutorService (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$5:139, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 1332801233 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$230)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$7:125, NodeTestTask (org.junit.platform.engine.support.hierarchical)
invoke:-1, 1146843363 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$229)
around:135, Node (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$8:123, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 270036154 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$228)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
executeRecursively:122, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:80, NodeTestTask (org.junit.platform.engine.support.hierarchical)
accept:-1, 756519367 (org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService$$Lambda$234)
forEach:1540, ArrayList (java.util)
invokeAll:38, SameThreadHierarchicalTestExecutorService (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$5:139, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 1332801233 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$230)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$7:125, NodeTestTask (org.junit.platform.engine.support.hierarchical)
invoke:-1, 1146843363 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$229)
around:135, Node (org.junit.platform.engine.support.hierarchical)
lambda$executeRecursively$8:123, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:-1, 270036154 (org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$228)
execute:73, ThrowableCollector (org.junit.platform.engine.support.hierarchical)
executeRecursively:122, NodeTestTask (org.junit.platform.engine.support.hierarchical)
execute:80, NodeTestTask (org.junit.platform.engine.support.hierarchical)
submit:32, SameThreadHierarchicalTestExecutorService (org.junit.platform.engine.support.hierarchical)
execute:57, HierarchicalTestExecutor (org.junit.platform.engine.support.hierarchical)
execute:51, HierarchicalTestEngine (org.junit.platform.engine.support.hierarchical)
execute:108, EngineExecutionOrchestrator (org.junit.platform.launcher.core)
execute:88, EngineExecutionOrchestrator (org.junit.platform.launcher.core)
lambda$execute$0:54, EngineExecutionOrchestrator (org.junit.platform.launcher.core)
accept:-1, 430960721 (org.junit.platform.launcher.core.EngineExecutionOrchestrator$$Lambda$194)
withInterceptedStreams:67, EngineExecutionOrchestrator (org.junit.platform.launcher.core)
execute:52, EngineExecutionOrchestrator (org.junit.platform.launcher.core)
execute:96, DefaultLauncher (org.junit.platform.launcher.core)
execute:75, DefaultLauncher (org.junit.platform.launcher.core)
processAllTestClasses:99, JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor (org.gradle.api.internal.tasks.testing.junitplatform)
access$000:79, JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor (org.gradle.api.internal.tasks.testing.junitplatform)
stop:75, JUnitPlatformTestClassProcessor (org.gradle.api.internal.tasks.testing.junitplatform)
stop:61, SuiteTestClassProcessor (org.gradle.api.internal.tasks.testing)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:566, Method (java.lang.reflect)
dispatch:36, ReflectionDispatch (org.gradle.internal.dispatch)
dispatch:24, ReflectionDispatch (org.gradle.internal.dispatch)
dispatch:33, ContextClassLoaderDispatch (org.gradle.internal.dispatch)
invoke:94, ProxyDispatchAdapter$DispatchingInvocationHandler (org.gradle.internal.dispatch)
stop:-1, $Proxy2 (com.sun.proxy)
stop:133, TestWorker (org.gradle.api.internal.tasks.testing.worker)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:566, Method (java.lang.reflect)
dispatch:36, ReflectionDispatch (org.gradle.internal.dispatch)
dispatch:24, ReflectionDispatch (org.gradle.internal.dispatch)
dispatch:182, MessageHubBackedObjectConnection$DispatchWrapper (org.gradle.internal.remote.internal.hub)
dispatch:164, MessageHubBackedObjectConnection$DispatchWrapper (org.gradle.internal.remote.internal.hub)
run:414, MessageHub$Handler (org.gradle.internal.remote.internal.hub)
onExecute:64, ExecutorPolicy$CatchAndRecordFailures (org.gradle.internal.concurrent)
run:48, ManagedExecutorImpl$1 (org.gradle.internal.concurrent)
runWorker:1128, ThreadPoolExecutor (java.util.concurrent)
run:628, ThreadPoolExecutor$Worker (java.util.concurrent)
run:56, ThreadFactoryImpl$ManagedThreadRunnable (org.gradle.internal.concurrent)
run:834, Thread (java.lang)