aws / aws-sdk-java

The official AWS SDK for Java 1.x (In Maintenance Mode, End-of-Life on 12/31/2025). The AWS SDK for Java 2.x is available here: https://github.com/aws/aws-sdk-java-v2/
https://aws.amazon.com/sdkforjava
Apache License 2.0
4.13k stars 2.83k forks source link

What causes this spike in "socket closed" SslExceptions? #2540

Closed erodewald closed 3 years ago

erodewald commented 3 years ago

On a daily basis, my service makes DynamoDB requests, and occasionally sees latency spikes where the requests are lasting anywhere from 10 to 20 seconds—ultimately resulting in socket closure.

Describe the issue

In my AWS-hosted EKS containerized (on Debian Slim) Java (JDK11) application, I am using AWS SDK v1.11.837 to interact with our DynamoDB instance. Typical behavior is fine, but at least once per day, a container will begin seeing latency spikes which my JFR (datadog) tracing indicates the culprit as javax.net.ssl.SslException: Socket closed. Overall, my requests appear to succeed 🤔 but these long-duration DynamoDB operations cause service degradation.

I would like to better understand what the likely culprit of these socket closures are (I have not seen many of these in the Issues reported here which are attributed to SslException).

Steps to Reproduce

I cannot provide a code snippet to reproduce this, but I can provide traces and logs (below).

Current Behavior

DynamoDB requests through the SDK take a very long time and likely fail to update the item (unable to confirm). I do not have any data about connections available at the time, but I am curious if perhaps they are being exhausted.

javax.net.ssl.SSLException: Socket closed
at java.base/sun.security.ssl.Alert.createSSLException(Unknown Source)
at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
at java.base/sun.security.ssl.SSLTransport.decode(Unknown Source)
at java.base/sun.security.ssl.SSLSocketImpl.decode(Unknown Source)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(Unknown Source)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(Unknown Source)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at jdk.internal.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy28.connect(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1330)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:5136)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:5103)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeUpdateItem(AmazonDynamoDBClient.java:4722)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.updateItem(AmazonDynamoDBClient.java:4688)
at com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper$SaveObjectHandler.doUpdateItem(DynamoDBMapper.java:871)
at com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper$2.executeLowLevelRequest(DynamoDBMapper.java:611)
at com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper$SaveObjectHandler.execute(DynamoDBMapper.java:750)
at com.amazonaws.services.dynamodbv2.datamodeling.DynamoDBMapper.save(DynamoDBMapper.java:640)
at com.amazonaws.services.dynamodbv2.datamodeling.AbstractDynamoDBMapper.save(AbstractDynamoDBMapper.java:123)
at com.smartthings.dynamodb.DefaultDynamoDBMapperService.lambda$save$30(DefaultDynamoDBMapperService.java:175)
at ratpack.exec.Blocking.lambda$op$7(Blocking.java:233)
at ratpack.exec.Blocking$1.lambda$get$0(Blocking.java:70)
at ratpack.exec.Blocking.intercept(Blocking.java:246)
at ratpack.exec.Blocking.lambda$intercept$8(Blocking.java:244)
at ratpack.dropwizard.metrics.internal.DefaultBlockingExecTimingInterceptor.intercept(DefaultBlockingExecTimingInterceptor.java:54)
at ratpack.exec.Blocking.intercept(Blocking.java:244)
at ratpack.exec.Blocking.lambda$intercept$8(Blocking.java:244)
at ratpack.logging.MDCInterceptor.intercept(MDCInterceptor.java:196)
at ratpack.exec.Blocking.intercept(Blocking.java:244)
at ratpack.exec.Blocking.access$000(Blocking.java:35)
at ratpack.exec.Blocking$1.get(Blocking.java:68)
at ratpack.exec.Blocking$1.get(Blocking.java:61)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.ru

There was a slight spike in volume of traffic (and as a result, a correlating spike in network Rx and Tx indicating we made more calls out to DynamoDB during this time).

Your Environment

Any guidance is appreciated. If I should direct this to AWS support I can also do that.

debora-ito commented 3 years ago

Seeing some occasional Socket closed errors is common, but seeing a spike of occurrences is not.

Can you share a sample code showing how you are creating the client?

I see the issues you are experiencing are in production but would it be possible to enable the client side metrics? https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/generating-sdk-metrics.html The Request Metrics will show number of connections, request time and other metrics that can help in toubleshooting.

erodewald commented 3 years ago

Thanks for taking a look. Here's the client creation, trimmed to the relevant parts:

@Provides
@Singleton
public static AmazonDynamoDB dynamoDBClient(DynamoDBConfig config) {
  AmazonDynamoDBClientBuilder builder = AmazonDynamoDBClientBuilder.standard()
    .withClientConfiguration(getClientConfiguration(config));
  builder.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(config.getEndpointUrl(), config.getRegion())); 
  return builder.build();
}
private static ClientConfiguration getClientConfiguration(DynamoDBConfig config) {
  ClientConfiguration clientConfiguration = new ClientConfigurationFactory().getConfig()
    .withRetryPolicy(PredefinedRetryPolicies.DYNAMODB_DEFAULT);
  setTimeouts(clientConfiguration, config.getTimeouts());
  return clientConfiguration;
}

I can enable client metrics for this in production (it's one microservice among many—should be alright) but I'd like to know what SDK metrics I should look for in CloudWatch to make it easier to justify enabling it. It will take me some time to get that configured, deployed, and I'll report back when I have enough to share for debugging.

debora-ito commented 3 years ago

Thank you for the code snippets.

but I'd like to know what SDK metrics I should look for in CloudWatch to make it easier to justify enabling it.

It's hard to say at this point, as we don't have indication of what's closing the connections. I believe once you enable client side metrics you get all the available SDK metrics - for reference, a list of metrics can be found here: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/index.html?com/amazonaws/metrics/package-summary.html

github-actions[bot] commented 3 years ago

It looks like this issue hasn’t been active in longer than a week. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.