aws / aws-sdk-java-v2

The official AWS SDK for Java - Version 2
Apache License 2.0
2.16k stars 835 forks source link

Child channel count was caught attempting to be increased over max concurrency #1803

Open lbourdages opened 4 years ago

lbourdages commented 4 years ago

I found a log at WARN level that told me to contact the Java SDK team, so here's the bug report.

Describe the bug

We run a Kinesis Client Library application. During what appears to be normal operation, a log printed out telling me to contact the Java SDK team.

Expected Behavior

N/A

Current Behavior

Here's the log in its entirety:

Child channel count was caught attempting to be increased over max concurrency. Please report this issue to the AWS SDK for Java team.

The logger name is "software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord"

Steps to Reproduce

It happened out of nowhere, I don't know

Possible Solution

N/A

Context

KCL github: https://github.com/awslabs/amazon-kinesis-client

Your Environment

debora-ito commented 4 years ago

Hi @lbourdages thank you for being diligent and reporting this to us. I understand the issue is happening out of nowhere, but can you answer some questions to help us isolate the case so we can try to reproduce it in our side?

lbourdages commented 4 years ago

Hi!

Here are a few minutes of logs. Note that I've scrubbed off logs from our business logic but they should not be having any impact on the operation of the Netty internals anyway.

logs.txt

The service had been running for about 25 hours at the time the log happened.

Several hours before, we had two instances of this: Apr 30, 2020 @ 01:43:18.302 WARN software.amazon.awssdk.http.nio.netty.internal.ResponseHandler Subscriber cancelled before all events were published Apr 29, 2020 @ 22:47:53.531 WARN software.amazon.awssdk.http.nio.netty.internal.ResponseHandler Subscriber cancelled before all events were published

That's all I can find.

Bas83 commented 3 years ago

We just experienced this too on version 2.3.3 of the amazon-kinesis-client after upgrading from 2.2.5 We're also seeing fairly frequent logs saying "An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception." with the following stacktrace:

software.amazon.awssdk.http.nio.netty.internal.http2.Http2ConnectionTerminatingException: An I/O error occurred on an associated Http2 stream [id: 0x16f5249b, L:/10.0.2.231:44260 ! R:kinesis.eu-west-1.amazonaws.com/99.80.34.137:443](H2 - 8233) at software.amazon.awssdk.http.nio.netty.internal.http2.Http2StreamExceptionHandler.exceptionCaught(Http2StreamExceptionHandler.java:50) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) at software.amazon.awssdk.http.nio.netty.internal.UnusedChannelExceptionHandler.exceptionCaught(UnusedChannelExceptionHandler.java:52) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) at io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) at io.netty.handler.codec.http2.AbstractHttp2StreamChannel.windowUpdateFrameWriteComplete(AbstractHttp2StreamChannel.java:118) at io.netty.handler.codec.http2.AbstractHttp2StreamChannel.access$000(AbstractHttp2StreamChannel.java:57) at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.updateLocalWindowIfNeeded(AbstractHttp2StreamChannel.java:850) at io.netty.handler.codec.http2.AbstractHttp2StreamChannel$Http2ChannelUnsafe.beginRead(AbstractHttp2StreamChannel.java:780) at io.netty.channel.DefaultChannelPipeline$HeadContext.read(DefaultChannelPipeline.java:1362) at io.netty.channel.AbstractChannelHandlerContext.invokeRead(AbstractChannelHandlerContext.java:686) at io.netty.channel.AbstractChannelHandlerContext.read(AbstractChannelHandlerContext.java:671) at software.amazon.awssdk.http.nio.netty.internal.http2.FlushOnReadHandler.read(FlushOnReadHandler.java:40) at io.netty.channel.AbstractChannelHandlerContext.invokeRead(AbstractChannelHandlerContext.java:686) at io.netty.channel.AbstractChannelHandlerContext.read(AbstractChannelHandlerContext.java:671) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.requestDemand(HandlerPublisher.java:130) at software.amazon.awssdk.http.nio.netty.internal.nrs.HttpStreamsHandler$1.requestDemand(HttpStreamsHandler.java:191) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.flushBuffer(HandlerPublisher.java:345) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.receivedDemand(HandlerPublisher.java:291) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.access$200(HandlerPublisher.java:61) at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:495) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.base/java.lang.Thread.run(Thread.java:832)

This happend 4 times the last day. They don't seem to occur at the same time as the other error though.

We're going to roll back to the old version now. Would appreciate any update on this.

debora-ito commented 3 years ago

@Bas83 thank you for reporting.

Regarding the Http2ConnectionTerminatingException there's an open issue in the amazon-kinesis-client repository, I suggest you upvote there too: https://github.com/awslabs/amazon-kinesis-client/issues/755

mosche commented 2 years ago

I'm facing the same issue when writing a stream of records (batch API) to Kinesis using KinesisAsyncClient (v 2.17.106/127). Scenario is as follows: The stream is processed by x threads in parallel (Apache Spark tasks), each thread triggers up to y concurrent requests. All threads share one instance of the async client. Currently I'm still testing locally, not even on a cluster.

Setting particularly X, but also Y too high will repeatedly fail the stream within 1 minute. Even setting numRetries unreasonably high does not prevent the issue.

Initially I'm seeing this periodically:

22/02/11 18:28:48 WARN software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord: [Channel: 11fa7015] Child channel count was caught attempting to be increased over max concurrency. Please report this issue to the AWS SDK for Java team.

soon after followed by:

java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: The connection was closed during the request. The request will usually succeed on a retry, but if it does not: consider disabling any proxies you have configured, enabling debug logging, or performing a TCP dump to identify the root cause. If this is a streaming operation, validate that data is being read or written in a timely manner. Channel Information: ChannelDiagnostics(channel=[id: 0x551a2dfd, L:/192.168.178.20:50484 - R:kinesis.eu-central-1.amazonaws.com/3.123.12.193:443](H2 - 7), channelAge=PT0.001S, requestCount=1) Parent Channel Information: ChannelDiagnostics(channel=[id: 0x551a2dfd, L:/192.168.178.20:50484 - R:kinesis.eu-central-1.amazonaws.com/3.123.12.193:443], channelAge=PT10.166S, requestCount=0)
    at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:62)
    at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:51)
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)

An interesting observation is that for x<=2 even a high y works well. A higher x, on the other hand, quickly leads to the issue above if y>=2.

mosche commented 2 years ago

@debora-ito I tried looking into the issue. The strange thing is, when using a custom build of netty-nio-client (tag 2.17.127, openjdk 1.8.0_322) everything works fine. As soon as I use the jar published to maven I'm reliably running into the issue above. But I don't see any difference except for some build metadata

debora-ito commented 2 years ago

@mosche the custom build of netty-nio-client is written by you/your team? How are you checking the differences?

mosche commented 2 years ago

@debora-ito All I did was a mvn clean install -pl :netty-nio-client -P quick --am on the original code, though turns out i didn't pay close enough attention to transitive dependencies.

Digging further this turns out to be a classpath issue with different transitive versions of netty. In my case Spark 2.4.7 uses io.netty:netty-all:4.1.47.Final, which is a fat jar containing all netty io classes of all submodules. netty-all:4.1.72.Final has become an uber jar with dependencies. It looks like this causes incompatible classes to be loaded at runtime, causing the issue.

In my case forcing any consistent version of netty solves the issue.

 io.netty:netty:3.9.9.Final
-io.netty:netty-all:4.1.47.Final
+io.netty:netty-all:4.1.72.Final
 io.netty:netty-buffer:4.1.72.Final
 io.netty:netty-codec:4.1.72.Final
+io.netty:netty-codec-dns:4.1.72.Final
+io.netty:netty-codec-haproxy:4.1.72.Final
 io.netty:netty-codec-http:4.1.72.Final
 io.netty:netty-codec-http2:4.1.72.Final
+io.netty:netty-codec-memcache:4.1.72.Final
+io.netty:netty-codec-mqtt:4.1.72.Final
+io.netty:netty-codec-redis:4.1.72.Final
+io.netty:netty-codec-smtp:4.1.72.Final
+io.netty:netty-codec-socks:4.1.72.Final
+io.netty:netty-codec-stomp:4.1.72.Final
+io.netty:netty-codec-xml:4.1.72.Final
 io.netty:netty-common:4.1.72.Final
 io.netty:netty-handler:4.1.72.Final
+io.netty:netty-handler-proxy:4.1.72.Final
 io.netty:netty-resolver:4.1.72.Final
+io.netty:netty-resolver-dns:4.1.72.Final
+io.netty:netty-resolver-dns-classes-macos:4.1.72.Final
+io.netty:netty-resolver-dns-native-macos:4.1.72.Final
 io.netty:netty-tcnative-classes:2.0.46.Final
 io.netty:netty-transport:4.1.72.Final
 io.netty:netty-transport-classes-epoll:4.1.72.Final
+io.netty:netty-transport-classes-kqueue:4.1.72.Final
+io.netty:netty-transport-native-epoll:4.1.72.Final
+io.netty:netty-transport-native-kqueue:4.1.72.Final
 io.netty:netty-transport-native-unix-common:4.1.72.Final
+io.netty:netty-transport-rxtx:4.1.72.Final
+io.netty:netty-transport-sctp:4.1.72.Final
+io.netty:netty-transport-udt:4.1.72.Final
mosche commented 2 years ago

Looks like 4.1.69.Final is the first correct uber jar that doesn't contain classes itself ...

mosche commented 2 years ago

@debora-ito Also wondering, is com.typesafe.netty » netty-reactive-streams-http an intentional dependency of netty-nio-client?

debora-ito commented 2 years ago

Yes.

https://github.com/aws/aws-sdk-java-v2/blob/76aef121ca8a5abfe261bf24acbee06efeeed668/http-clients/netty-nio-client/pom.xml#L90-L93

debora-ito commented 1 year ago

@lbourdages @Bas83 we want to investigate this error further, but we need a sample code so we can reproduce in our side.

Do you have self-contained repro code that reliably show the error, using the latest version of the SDK?

Bas83 commented 1 year ago

@lbourdages @Bas83 we want to investigate this error further, but we need a sample code so we can reproduce in our side.

Do you have self-contained repro code that reliably show the error, using the latest version of the SDK?

I'm afraid we don't, it's been a while and we've been running on version 2.4.3 for a while now. The only problem with that version that we have (or rather with the 2.4.1 one before that, but we haven't tried again) is that updating to Java 17 (from 16) miraculously makes kinesis skip part of the records while consuming.