aws / aws-sdk-java

The official AWS SDK for Java 1.x (In Maintenance Mode, End-of-Life on 12/31/2025). The AWS SDK for Java 2.x is available here: https://github.com/aws/aws-sdk-java-v2/
https://aws.amazon.com/sdkforjava
Apache License 2.0
4.13k stars 2.83k forks source link

Call to AWSApplicationAutoScalingClient.putScalingPolicy hangs indefinitely #3141

Closed lordpengwin closed 2 months ago

lordpengwin commented 2 months ago

Upcoming End-of-Support

Describe the bug

When deploying a service to ECS using the Java SDK I've seen many instances where a call to AWSApplicationAutoScalingClient.putScalingPolicy() never returns. This happens about 10% of the time that I use this call. I've even set withSdkRequestTimeout() on the PutScalingPolicyRequest and it still hangs. Note: The policy does get applied to the the service but the call never returns.

Is this a known problem? Is there a way that I can debug or work around it.

Expected Behavior

The SDK call should return or timeout.

Current Behavior

Hangs forever

Reproduction Steps

This is my code: autoScalingClient.putScalingPolicy(new PutScalingPolicyRequest() .withResourceId(resourceID) .withServiceNamespace(ServiceNamespace.Ecs) .withPolicyName(String.format(APPLICATION_SCALING_POLICY, deployedServiceName)) .withScalableDimension(ScalableDimension.EcsServiceDesiredCount) .withPolicyType(PolicyType.TargetTrackingScaling) .withTargetTrackingScalingPolicyConfiguration(new TargetTrackingScalingPolicyConfiguration() .withPredefinedMetricSpecification(new PredefinedMetricSpecification().withPredefinedMetricType(autoScaleConfig.getScaleUpMetric()).withResourceLabel(loadBalancerArn.substring(loadBalancerArn.indexOf("app/")) + "/" + targetGroupARN.substring(targetGroupARN.indexOf("targetgroup/")))) .withTargetValue(autoScaleConfig.getScaleUpThreshold()) .withScaleOutCooldown(autoScaleConfig.getScaleUpCooldown()) .withScaleInCooldown(autoScaleConfig.getScaleDownCooldown()) ) .withSdkRequestTimeout(30000) );

Possible Solution

No response

Additional Information/Context

I've also seen this when doing the same call against SageMaker.

AWS Java SDK version used

1.12.435

JDK version used

17.0.6

Operating System and version

container ubi9-minimal:latest

debora-ito commented 2 months ago

It's unusual for the SDK client to hang indefinitely, I expect the request to timeout at some point. Since you see it across different clients I wonder if the issue is related to ECS.

Have you tried to reproduce in a different environment outside a container? Are you setting any custom ClientConfiguration when creating the autoScalingClient? Can you generate the verbose wirelogs? Instructions here. Make sure to redact any sensitive information like access keys.

lordpengwin commented 2 months ago

I've seen this with both ECS and SageMaker, though in either case I'm making the call to an Auto Scaling Group. I believe that I've seen this both from the container and from an Amazon Linux development machine. I'm not setting a custom ClientConfiguration on the autoScalingClient. I've also had this happen in multiple AWS accounts. I will try to do some experiments today to see if I can recreate the problem consistently, it has happened randomly in the past. If I can, I will try to enable the wire logs as described above. I will also try to get a Java thread dump.

lordpengwin commented 2 months ago

So I might have been wrong here. I managed to get my application to hang again and it does not appear to be stuck where I thought it was. It appears that it is simply not exiting. A thread dump shows this running:

`"s3-transfer-manager-worker-1" #40 prio=5 os_prio=0 cpu=135446.59ms elapsed=8370.90s allocated=4078M defined_classes=95 tid=0x00007f327202d0d0 nid=0x70 waiting on condition [0x00007f323a4fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.6/Native Method)

I suspect that the problem is that an S3 transfer manager is not being cleaned up correctly.

github-actions[bot] commented 2 months ago

This issue is now closed.

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

lordpengwin commented 2 months ago

I'm pretty sure that this was my problem. Thanks for the help