Closed oripwk closed 4 years ago
observed the same using getMetricData
1) How often are you seeing this issue? 2) Do you have test code to consistently repo the issue?
As discussed in #452, we made some fixes which reduced this error occurrence a lot. But there is still small chance of seeing this error. Code to repo will help us debug the issue further. I would also suggest you to try with the latest version and see if there is any improvement.
We have the same issue as well. We upgraded the aws sdk to the latest version 2.7.32 in the hope that the issue will be fixed, but unfortunately it didn't help. This issue happens when the application runs for a long time without a restart. But as soon as there is an application restart, we don't see this issue anymore until the next prolonged run. It usually happens after about 24 hours of the application start. Unfortunately I do not have a reproducible code, but we just use the default metrics publishing configuration.
@varunnvs92
I think that because it's reproducing for other people and it's very similar to https://github.com/aws/aws-sdk-java-v2/issues/452, it should deserve more attention.
We've found that only solution for now to this issue and #452 is to use the sync clients which use apache http client instead of netty.
Apologies for the delayed response. Unfortunately, I'm not able to reproduce the issue. Can you provide more details of how to reproduce the issue? For example, the apis you are using, and the number of requests per second
I can see Server failed to send complete response
errors with retry disabled because it's possible that the connection gets closed remotely mid request, and normally the next retry should pick up a new connection and succeeds. Is there any chance you disabled retry on the SDK client? Can you provide code snippet of how you create the client?
We are using with default config, default connection parameters.
CloudWatchAsyncClient.builder().build()
And it's not single or random requests that fail. Like @oripwk said, it's constant failure, all requests fail. We are not triggering it very heavily either. Burst of like 5 reqs in sec once in a minute. And it's surely related to the HTTP client as there are zero problems with Apache HTTP client.
@trautonen Thank you for sharing the info. Unfortunately, still no luck reproducing the issue.
I tried sending 5 concurrent requests every minute for a total of 20 minutes and all requests succeeded; I did notice that some requests failed at the first attempt but succeeded in the next retry.
Is there anyway you can provide debug logs? Just wanted to check if all retry attempts failed of the same reason in your use case.
StabilityTestRunner.newRunner()
.testName("CloudWatchAsyncStabilityTest." + testName)
.futureFactory(cloudWatchAsyncClient.putMetricData(b -> b.namespace(namespace)
.metricData(metrics)))
.totalRuns(20)
.requestCountPerRun(5)
.delaysBetweenEachRun(Duration.ofMinutes(1))
.run();
Could you try setting connectionMaxIdleTimeout
to 5 seconds? In my tests, reducing the max idle timeout largely reduced the chances of the idle connection getting closed mid request on the service side.
CloudWatchAsyncClient.builder()
.httpClientBuilder(NettyNioAsyncHttpClient.builder()
.connectionMaxIdleTime(Duration.ofSeconds(5)))
.build();
We have pushed out the change to update the default connectionMaxIdleTime for cloudwatch client via #1450. The change has been released as part of 2.9.14
.
I got similar issue (exactly the same exception) and I wonder if it's also gonna be fixed with this.
I create a Cloudwatch client like this
private static final AwsCredentialsProvider CREDENTIALS_PROVIDER = AwsCredentialsProviderChain.of(
ProfileCredentialsProvider.create(),
EnvironmentVariableCredentialsProvider.create(),
ContainerCredentialsProvider.builder().asyncCredentialUpdateEnabled(true).build(),
InstanceProfileCredentialsProvider.create()
);
private static CloudWatchAsyncClient initializeCloudWatchClient() {
return CloudWatchAsyncClient.builder()
.credentialsProvider(CREDENTIALS_PROVIDER)
.build();
}
My metrics are send properly for a few days and then suddenly in the middle of the night most of the requests will start failing with the same exception as the author mentioned. The traffic levels are actually lower during the night so it seems to be a bit different scenario.
I will try the new version either way, but if you see anything obvious that I'm missing let me know.
Hi @sycyhy, have you tried the new version? Are you still experiencing the issue?
Hi @zoewangg - yep it's on Production right now, but I will be able to confirm after a weekend because during the week there is number of deployments and the issue pops up when you leave the app without any restarts for a few days :)
Seeing the same issue when using with KCL under no particular load
@kaffepanna Are you still experiencing the issue with the latest version?
@zoewangg using version 2.9.20
Yep, my issue is still present - updating the library to 2.9.14 sadly did not help
@sycyhy Thank you for reporting back. Looks like the issue seems to occur after letting the app running for a while. Did it ever recover after the errors or all subsequent requests failed?
I'll continue investigating.
@zoewangg nope it does not recover, the app needs to be restarted. I also just noticed that the IOException slightly changed with the new version ->
Caused by: s.a.a.c.e.SdkClientException: null
at s.a.a.c.e.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
at s.a.a.c.i.u.ThrowableUtils.asSdkException(ThrowableUtils.java:98)
at s.a.a.c.i.h.p.s.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:125)
... 35 common frames omitted
Caused by: java.io.IOException: Channel was closed before it could be written to.
... 28 common frames omitted"
Were all errors "IOException: Channel was closed before it could be written to" or just some of them?
They were caused by the same reason - the server has closed the connection remotely in the middle of the request, but for this error, it seems the connection has been closed before the SDK was able to write the request.
yep you are right, only some of them (actually only 2 of those for hundreds of the Server failed to send complete response
)
Still not able to reproduce the exact all requests failure issue, but found out one issue in the SDK. There are certain cases when conneciton=close
is present in the response header, the SDK is not closing the connection as expected and will try to reuse that connection for other requests. Thus if the next request picks up that bad connection right before service closes the connection, IOException would be thrown once the connection is closed remotely.
@sycyhy, thanks for raising this we're seeing a very similar issue and thanks to all who are working to resolve it. Lukasz, I'm wondering whether you are seeing any associated CPU spikes when this problem arises?
Hi @sycyhy @robinspollak @kaffepanna,
The fix has been released as part of 2.9.24
. Could you try with the latest version and see if the issue is resolved for your use case? I was not able to reproduce the exact issues you were seeing, so it would be great if you could help verify the fix.
@zoewangg, 2.9.24
is in production now and I'll report back! Thanks very much
@zoewangg, unfortunately we're still seeing the following error:
java.lang.RuntimeException: software.amazon.awssdk.core.exception.SdkClientException
at
software.amazon.kinesis.retrieval.AWSExceptionManager.apply(AWSExceptionManager.java:65)
at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.blockingExecute(CloudWatchMetricsPublisher.java:89)
at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.publishMetrics(CloudWatchMetricsPublisher.java:74)
at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.runOnce(CloudWatchPublisherRunnable.java:138)
at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.run(CloudWatchPublisherRunnable.java:84)
at java.lang.Thread.run(Thread.java:748) Caused by: software.amazon.awssdk.core.exception.SdkClientException: null
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
at software.amazon.awssdk.core.internal.util.ThrowableUtils.asSdkException(ThrowableUtils.java:98)
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:125)
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:107)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.onError(MakeAsyncHttpRequestStage.java:249)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.lambda$notifyIfResponseNotCompleted$2(ResponseHandler.java:391)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:174)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.notifyIfResponseNotCompleted(ResponseHandler.java:391)
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.channelInactive(ResponseHandler.java:142)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
--
.
.
.
It is the same as the one we were seeing before the new release, let me know if a full stack trace or a sample of how we are setting up our kinesis client would help.
@robinspollak Thanks for reporting back. Yes, it'd be helpful if you could share the sample code of how you set up the cloudwatch client. Did all requests fail or just some requests failed? Are you running it on an EC2 instance? If so, could you share what type of that instance is?
@zoewangg We set up our (scala) KCL integration as follows:
class KinesisConfigurationProvider @Inject()(
configuration: Configuration,
recordProcessorFactory: ShardRecordProcessorFactory
) {
def apply(streamName: String): ConfigsBuilder = {
val workerId = InetAddress.getLocalHost.getCanonicalHostName + ":" + UUID
.randomUUID()
val kinesisAsyncClient = KinesisAsyncClient.builder().region(Region.EU_WEST_1).build()
val dynamoDBClient = DynamoDbAsyncClient.builder().region(Region.EU_WEST_1).build()
val cloudWatchClient = CloudWatchAsyncClient.builder().region(Region.EU_WEST_1).build()
new ConfigsBuilder(
streamName,
s"app-$streamName-processor",
kinesisAsyncClient,
dynamoDBClient,
cloudWatchClient,
workerId,
recordProcessorFactory
)
}
}
It seems that only some requests fail. Upon deploy or reboot we won't see the log lines for about 3 hours. They then escalate in frequency for around 24 hours and, eventually, the server stops responding and needs to be rebooted. Our servers run on an m4.xlarge EC2 instance. Hope this helps and I'm happy to provide more information
@robinspollak Thanks for the info. A few follow up questions: did you see any other non cloudwatch errors before the server stopped responding? When the server stopped responding, did all request fail of "Server failed to send complete response " or it just got stuck somehow? What KCL version are you using?
@sycyhy are you also using KCL or just using cloudwatch async client?
Hi @zoewangg, sorry for the delayed response. We don't really see any other non-cloudwatch errors but we're also trying our best to determine with certainty that cloudwatch is causing the problems. Thus far, we haven't been able to rule out another source of the problem.
In terms of the server behavior when it stops responding... the entire server just ceases to respond to any requests and there are no responses to any requests and thus no errors in response to requests. We're using KCL 2.2.4.
Thanks again for all your help.
@zoewangg I use it via micrometer and not KCL which is a layer of abstraction of the async client -> which I passed manually when I configure the reporter.
btw -> I needed to revert to the legacy client due this (I could not wake up more people during the night sadly haha).
Hi @robinspollak, for the server not responding issue, my guess is that it might not be related to the cloudwatch errors. I'd recommend taking a thread dump or heap dump to see if there's a blocking thread or increased resource usage causing the server to stop responding.
We have pushed out a couple of fixes #1481 and #1479 (available in 2.10.0
) that could cause the server failed to complete the response
errors. KCL has also released a new version 2.2.5. Could you try the new versions?
@sycyhy Sorry to hear that you had to revert the client code 😞 We are currently working on setting up more tests scenarios that cover more real-world use cases to catch the issues.
Closing the issue due to lack of activity. Please create a new issue if you continue to see this error with the latest SDK.
@oripwk Hi, I have read this post and others like it and I notice that you never could check if the updates on aws-sdk-java-v2 worked to fix your bug. I am currently using version 2.10.41 and the error persists in this version. Did you find out how to fix this bug ?
This issue is exactly as #452, but instead of S3 it happens in CloudWatch.
Expected Behavior
putMetricData(…)
should not throw an exception.Current Behavior
java.io.IOException: Server failed to send complete response
is thrown several times a minute. Most of the requests seem to succeed.Possible Solution
I have no idea.
Steps to Reproduce (for bugs)
Create a CloudWatch client that sends
putMetricData
requests about 100 times a minute, where each request has 20 metrics, and each metric contains 150 values.Context
Monitoring real-time streaming application.
Your Environment
Linux 4.14.62-65.117.amzn1.x86_64 x86_64 GNU/Linux