aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.03k stars 323 forks source link

Too long sleep time on http400 error #501

Open mackyanyan opened 1 year ago

mackyanyan commented 1 year ago

I oversee the operation of over 200 onpremis machines with SSM We implement frequent updates. When I upgraded agent version from 3.1.1927 to 3.2.286, persistent disconnections emerged. The accompanying log records instances of these occurrences.

WARN [CredentialRefresher] Status code %!s(int=400) returned from AWS API. RequestId: c9f7f596-2b20-4b0c-bcee-XXXXXXXXXXXX Message: The connection was closed by the server at some point between the client sending the request and the client receiving the entire response INFO [CredentialRefresher] Sleeping for 25h17m57s before retrying retrieve credentials

This issue transpires on a daily basis on one of our servers. I don't believe network malfunctions are the root cause, as restarting the SSM Agent Service resolves the disconnections. A review of the source code reveals instructions to allow for a 24-hour wait for HTTP error code 400. This seems excessively prolonged. A few additional attempts, among other remedial measures, could potentially eliminate such extended disconnection incidents. Current version is 3.2.419.0 but it still keeps happening.

sluggard76 commented 1 year ago

We have deprecated the previous 3.2 versions. Please update to the latest SSM Agent v3.2.582.0. Please let us know if the problem persists.

Seantonomous commented 1 year ago

The logic to sleep for 24 hours is still present in v3.2.582.0

https://github.com/aws/amazon-ssm-agent/blob/7d0a6c29e6a44004830adb2d4052e2f4f63fa9f8/core/app/credentialrefresher/credentialrefresher.go#L54 https://github.com/aws/amazon-ssm-agent/blob/7d0a6c29e6a44004830adb2d4052e2f4f63fa9f8/core/app/credentialrefresher/credentialrefresher.go#L231

sluggard76 commented 1 year ago

@Seantonomous Got it. We'll be working on this issue.

mackyanyan commented 1 year ago

@sluggard76 @Seantonomous Thank you for your reaction.

Unfortunately, even with the latest version, the situation remains unchanged. The root cause is that http400 errors often occur, and for each occurrence of an http400 error, we perform a recovery operation through manual restart of SSM on a daily basis. It would be desirable to implement measures such as retrying when an error occurs or retrying after a short sleep time.

yysu commented 1 year ago

same issue +1

OkDoYun commented 1 year ago

3.2.582.0 도 같은 문제가 발생합니다.