aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

ssm-agent-worker max CPU usage at boot (infrequent) #353

Closed sapphirecat closed 3 months ago

sapphirecat commented 3 years ago

Sometimes, the ssm-agent-worker gets stuck consuming all CPU resources (e.g. 179%+ on a 2-vCPU instance) after reboot.

I'm not sure what's helpful, but I'm attaching the logs I have, except that I cut journalctl output down to the lines containing amazon-ssm only. The instance is run in America/New_York (GMT -05:00 currently) after initial configuration, and timestamps appear to be local.

I used top to send signal 15, then signal 9, to the worker (the first did not work) and the service did not appear to notice, so I restarted the whole snap.amazon-ssm-agent.amazon-ssm-agent.service service after a few more seconds (plus time it took to even find that name.)

This AMI is customized, of course, but ultimately derives from the current Ubuntu EC2 releases listing, us-east-1 20.04 amd64.

Attachment: logs.zip

svoeller99 commented 3 years ago

I'm experiencing the same symptom, although much more frequently of late. this is impacting ~20-25% of newly launched instances

kchitalia-amzn commented 3 years ago

@svoeller99 You are seeing the issue on Ubuntu as well?

svoeller99 commented 3 years ago

yes - we're running Ubuntu 18.04

saraiyakush commented 3 years ago

I am seeing the same issue on Ubuntu 20.04

voltuer commented 3 years ago

I have this same problem, really high CPU usage by amazon-ssm-agent, the machine gets totally unstable and there is no explanation.

What are we supposed to do ???

rjayanthi-prod commented 3 years ago

Hi @tr4g We had same issue and am keeping an eye on this thread and saw your comment. Could you please let us know what OS and Kernel version you are seeing the problem in? Thank you!

vb8448 commented 3 years ago

+1 same issue ...

iam-sysop commented 3 years ago

Windows Server 2019 - 4vCPU - SAME ISSUE.

VishnuKarthikRavindran commented 3 years ago

@tr4g @sapphirecat @svoeller99 @thecarnie @saraiyakush Thanks for reaching us. Sorry for the delay in response. Are we seeing this issue with the latest agent now?

sapphirecat commented 3 years ago

@VishnuKarthikRavindran It was somewhat rare, happening maybe once every month or two, launching on average 1.2 instances per day. (Just infrequently enough that I never built a script to automatically handle the situation.) It hasn't happened again for me since I filed the issue, but I can't say with confidence that it's fixed.

We have continued to track the latest Ubuntu 20.04 AMI, so we should be getting both agent and kernel updates accordingly.

radykal-com commented 3 years ago

@VishnuKarthikRavindran for me is happening still with version 3.0.1124.0 on Ubuntu 20.04:

snap info amazon-ssm-agent
name:      amazon-ssm-agent
summary:   Agent to enable remote management of your Amazon EC2 instance configuration
publisher: Amazon Web Services (aws✓)
store-url: https://snapcraft.io/amazon-ssm-agent
contact:   https://aws.amazon.com/contact-us/
license:   unset
description: |
  The SSM Agent runs on EC2 instances and enables you to quickly and easily
  execute remote commands or scripts against one or more instances. The agent
  uses SSM documents. When you execute a command, the agent on the instance
  processes the document and configures the instance as specified. Currently,
  the SSM Agent and Run Command enable you to quickly run Shell scripts on an
  instance using the AWS-RunShellScript SSM document.
commands:
  - amazon-ssm-agent.ssm-cli
services:
  amazon-ssm-agent: simple, enabled, inactive
snap-id:      T09mpujiTnzSdSCuqNkE7YXXTWDq13tC
tracking:     latest/stable/ubuntu-20.04
refresh-date: yesterday at 18:01 UTC
channels:
  latest/stable:    3.0.1124.0 2021-07-29 (4046) 26MB classic
  latest/candidate: 3.1.192.0  2021-08-19 (4662) 27MB classic
  latest/beta:      ↑
  latest/edge:      ↑
installed:          3.0.1124.0            (4046) 26MB classic
VishnuKarthikRavindran commented 3 years ago

Hi @radykal-com, Is this issue reproducible on your end? If possible, could you please check whether you are seeing this with the latest version? Thanks

radykal-com commented 3 years ago

Well, its's not easy to reproduce, as it happens randomly with very low frequency. It happened to 6 or 7 instances over a total of 100+. When it happens it happens from the moment the instance starts. I decided to just uninstall it from our AMIs

VishnuKarthikRavindran commented 3 years ago

Thanks @radykal-com for reaching us. We have done many improvements in the latest SSM agent versions. Please let us know if the issue persists with the latest one if you think of using the agent any time.

ghost commented 3 years ago

+1 here, Ubuntu 20.04, every 10 mins or so only running simple website in nginx docker on t2.mirco. Locks entire system 100% CPU for about 5 mins. Tried rebooting via console and on the cli.

This is pretty unacceptable and am interested in possibly receiving refund on my 3 reserved instances, how would I start that process so I can move to a more stable cloud server?

VishnuKarthikRavindran commented 3 years ago

Hi @WinterTFG, Sorry to hear about that. Could you please share us the repro steps if it is reproducible on your end?

Like said above, we have done many improvements in the latest SSM agent versions. If possible, could you run with the latest one. Thanks.

mkdotam commented 2 years ago

I'm seeing similar behaviour on the latest:

summary:   Agent to enable remote management of your Amazon EC2 instance configuration
publisher: Amazon Web Services (aws✓)
store-url: https://snapcraft.io/amazon-ssm-agent
license:   unset
description: |
  The SSM Agent runs on EC2 instances and enables you to quickly and easily
  execute remote commands or scripts against one or more instances. The agent
  uses SSM documents. When you execute a command, the agent on the instance
  processes the document and configures the instance as specified. Currently,
  the SSM Agent and Run Command enable you to quickly run Shell scripts on an
  instance using the AWS-RunShellScript SSM document.
commands:
  - amazon-ssm-agent.ssm-cli
services:
  amazon-ssm-agent: simple, enabled, active
snap-id:      T09mpujiTnzSdSCuqNkE7YXXTWDq13tC
tracking:     latest/stable/ubuntu-20.04
refresh-date: 18 days ago, at 01:03 CEST
channels:
  latest/stable:    3.0.1124.0 2021-07-29 (4046) 26MB classic
  latest/candidate: 3.1.282.0  2021-09-09 (4750) 27MB classic
  latest/beta:      ↑
  latest/edge:      ↑
installed:          3.0.1124.0            (4046) 26MB classic

It was stale for 156 hours, and was eating 300% CPU.

VishnuKarthikRavindran commented 2 years ago

Hi @mkdotam, It looks like the installed agent version is 3.0.1124.0. Could you please check whether you are seeing this with latest version - 3.1.282.0? Thanks

Whale-Observer-App commented 2 years ago

Still happening using snap version 3.1.338.0. I'm running ubuntu-focal-20.04-arm64. Happened twice just today

VishnuKarthikRavindran commented 2 years ago

Hi @Whale-Observer-App, May I know how did you reproduce this one? Also could you please attach the logs if possible. Thanks.

tomaskovacik commented 2 years ago

just rebooted 2nd time today, amazon-ssm-agent, revision 4046

lilmidnit commented 2 years ago

This has started happening to me today. 50%+ consistent. New servers, windows, elastic beanstalk created the server. I created a dump file of the process if that can help. IIS 10.0 running on 64bit Windows Server 2016/2.8.0

gcstr commented 2 years ago

I have the same here, since months ago one of my instances gets ssm-agent randomly peaking CPU to a point that it's not even accessible anymore.

AWS Ubuntu 20.04 amazon-ssm-agent 3.0.1124.0

tomaskovacik commented 2 years ago

add swap as 1st step after creating instance:

https://aws.amazon.com/premiumsupport/knowledge-center/ec2-memory-swap-file/

since this I never have issue with agent

VishnuKarthikRavindran commented 2 years ago

Thanks for reaching us again. We were able to reproduce this issue on our end. The fix was given in the following agent release https://github.com/aws/amazon-ssm-agent/releases/tag/3.1.426.0. Could you all please try updating to the latest one?

gcstr commented 2 years ago

Thanks! I just updated it. Given that the issue is pretty random, I can't immediately test. But I'll keep monitoring in the upcoming days.

Pxeba commented 2 years ago

I was dealing with this problem imagining it was some reaction to my code. However, after 3 days without much success I decided to try to send my code to another machine with the same operating system. It worked. My code stopped dying with this CPU spike coming from the ssm agent.

Ubuntu 20.04 Amazon ssm agent 4046

play-station commented 2 years ago

I don't know why this issue was closed if no solution was given. I'm experiencing it as well

baturinivan commented 1 year ago

Today the same situation. Can't even login to console normally due to excessive load. la is over 15

shashikachamod1992 commented 1 year ago

You can simply solve this problem by running , sudo snap remove amazon-ssm-agent

You can find the full answer here. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-uninstall-agent.html

lvlira commented 1 year ago

I don't know why this issue was closed if no solution was given. I'm experiencing it as well

Yes. Now, the same problem with me.

MarkBone commented 1 year ago

You can simply solve this problem by running , sudo snap remove amazon-ssm-agent

You can find the full answer here. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-uninstall-agent.html

The tutorial is ok, but when I remove amazon-ssm-agent, do I lose all aws monitoring functionality?

play-station commented 1 year ago

sudo snap remove amazon-ssm-agent

Seems like removing the amazon-ssm-agent will make us lose the monitoring functionality. So I think this is not an option

shashikachamod1992 commented 1 year ago

You can simply solve this problem by running , sudo snap remove amazon-ssm-agent You can find the full answer here. https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-uninstall-agent.html

The tutorial is ok, but when I remove amazon-ssm-agent, do I lose all aws monitoring functionality?

Agreed. This answer has a negative effect. Sometimes running a server 24/7 is more important than having monitored reports.

VishnuKarthikRavindran commented 1 year ago

Thanks for reaching us. @shashikachamod1992 @architect-aimonkey Would like to know what agent version are you using? Like said in the previous comments, this issue was fixed in agent version 3.1.426.0 https://github.com/aws/amazon-ssm-agent/releases/tag/3.1.426.0. Could you please confirm that?

forsythg commented 1 year ago

I can confirm that we appear to be impacted by this issue also, however, the version running is way higher, 3.1.1732.0 (have I missed something, should I be downgrading?)

This is similar to others, Ubuntu 20.04.5 (ARM64)

VishnuKarthikRavindran commented 1 year ago

@forsythg Could u pls check whether multiple versions of agents are running on the host? There should be only 1 amazon-ssm-agent and 1 ssm-agent-worker running on the instance. The confirmation can also be done by looking at the agent logs in /var/log/amazon/ssm(Linux).

forsythg commented 1 year ago

@forsythg Could u pls check whether multiple versions of agents are running on the host? There should be only 1 amazon-ssm-agent and 1 ssm-agent-worker running on the instance. The confirmation can also be done by looking at the agent logs in /var/log/amazon/ssm(Linux).

I don't think so, these are the logs for the time period, prior to this point, the logs repeat then this is the first occurrence of the unexpected EOF, from this point onwards is when the high CPU occurs up to when PID 915 it was killed by ourselves:

2022-10-31 15:14:27 INFO [ssm-agent-worker] [MessageService] [Association] Schedule manager refreshed with 0 associations, 0 new associations associated
2022-10-31 15:20:16 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF
2022-10-31 15:20:18 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread started
2022-10-31 15:20:18 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread done
2022-10-31 15:20:18 INFO [ssm-agent-worker] [HealthCheck] HealthCheck reporting agent health.
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Closing websocket channel connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:20:20 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] Failed to close websocket: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.171.16.46:50134->52.46.200.123:443: write: broken pipe
2022-10-31 15:20:20 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] closing controlchannel failed with error: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.171.16.46:50134->52.46.200.123:443: write: broken pipe
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Opening websocket connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Ending websocket pinger
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Successfully opened websocket connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Ending websocket listener
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Starting websocket pinger
2022-10-31 15:20:20 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Starting websocket listener
2022-10-31 15:25:10 INFO [ssm-agent-worker] [MessageService] [MessageHandler] started idempotency deletion thread
2022-10-31 15:25:12 INFO [ssm-agent-worker] [LongRunningPluginsManager] There are no long running plugins currently getting executed - skipping their healthcheck
2022-10-31 15:25:13 WARN [ssm-agent-worker] [MessageService] [MessageHandler] [Idempotency] encountered error open /var/lib/amazon/ssm/i-08a585ba48c526cfc/idempotency: no such file or directory while listing directories in /var/lib/amazon/ssm/i-08a585ba48c526cfc/idempotency
2022-10-31 15:25:13 INFO [ssm-agent-worker] [MessageService] [MessageHandler] ended idempotency deletion thread
2022-10-31 15:25:13 INFO [ssm-agent-worker] [HealthCheck] HealthCheck reporting agent health.
2022-10-31 15:25:16 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread started
2022-10-31 15:25:16 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread done
2022-10-31 15:25:17 INFO [ssm-agent-worker] [MessageService] [Association] Schedule manager refreshed with 0 associations, 0 new associations associated
2022-10-31 15:31:06 INFO [ssm-agent-worker] [HealthCheck] HealthCheck reporting agent health.
2022-10-31 15:31:06 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread started
2022-10-31 15:31:06 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF
2022-10-31 15:31:07 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] send failed reply thread done
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Closing websocket channel connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:31:09 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] Failed to close websocket: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.171.16.46:60122->52.46.192.163:443: write: broken pipe
2022-10-31 15:31:09 WARN [ssm-agent-worker] [MessageService] [MGSInteractor] closing controlchannel failed with error: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.171.16.46:60122->52.46.192.163:443: write: broken pipe
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Opening websocket connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Ending websocket pinger
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Successfully opened websocket connection to: wss://ssmmessages.eu-north-1.amazonaws.com/v1/control-channel/i-08a585ba48c526cfc?role=subscribe&stream=input
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Ending websocket listener
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Starting websocket pinger
2022-10-31 15:31:09 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Starting websocket listener
2022-10-31 15:39:31 INFO [amazon-ssm-agent] amazon-ssm-agent got signal:terminated value:0xffffb5c71280
2022-10-31 15:39:32 INFO [amazon-ssm-agent] Stopping Core Agent
2022-10-31 15:39:34 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Process ssm-agent-worker (pid:915) has been terminated, remove from worker pool
2022-10-31 15:39:34 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
2022-10-31 15:39:35 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:2474876) started
2022-10-31 15:39:35 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Receiving stop signal, stop worker monitor
chandrakantmkarale commented 1 year ago

We are hit by the same issue. The CPU usage goes very high. On some instances we have seen the OOM error getting triggered and processes being killed.

10:44:23 ip-10-0-3-115 kernel: sshd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Jan 10 10:44:23 ip-10-0-3-115 kernel: CPU: 0 PID: 5052 Comm: sshd Not tainted 5.10.157-139.675.amzn2.x86_64 #1 Jan 10 10:44:23 ip-10-0-3-115 kernel: Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006 Jan 10 10:44:23 ip-10-0-3-115 kernel: Call Trace: Jan 10 10:44:23 ip-10-0-3-115 kernel: dump_stack+0x57/0x70 Jan 10 10:44:23 ip-10-0-3-115 kernel: dump_header+0x4a/0x1f4 Jan 10 10:44:23 ip-10-0-3-115 kernel: oom_kill_process.cold+0xb/0x10 Jan 10 10:44:23 ip-10-0-3-115 kernel: out_of_memory+0xed/0x2d0

SSM Agent Details Installed Packages Name : amazon-ssm-agent Arch : x86_64 Version : 3.2.419.0 Release : 1 Size : 99 M

Will be trying the suggestion by @tomaskovacik to create a swap space.

jgfoster commented 1 year ago

I had an instance become unresponsive today (see here).

tomaskovacik commented 1 year ago

@chandrakantmkarale did it help?

thimslugga commented 1 year ago

Having some sort of swap space e.g. 1GiB swap file is necessary for the kernel to be able to perform proper memory management i.e. memory reclaim, etc.

MarkBone commented 1 year ago

I fought a lot to solve this finally got it! In my case it was a memory leakage of my application, a function opened numerous threads and did not kill them, until the system had no memory.

The fact that Amazon-SSM-Agent appears as the process that is most consuming memory, has absolutely no relationship!

The problem in fact is its application, some function/library is causing memory leakage.

How to identify?

Start by turning off services, or parts of your code, until ideal for which part/service is the villain. Closing the funnel until you reach the line of the code that is the villain of history.

Try this approach, I have 99.99% sure that that's it.

VishnuKarthikRavindran commented 3 months ago

Hi all, We have made some improvements in the latest version. We also addressed tight loop in the version -https://github.com/aws/amazon-ssm-agent/releases/tag/3.2.1542.0. Please reopen if the issue persists.

jgfoster commented 3 months ago

@VishnuKarthikRavindran – I've run apt update && apt list --upgradable and don't see ssd-agent-worker in the list. How do I get this fix?

thimslugga commented 3 months ago

@VishnuKarthikRavindran – I've run apt update && apt list --upgradable and don't see ssd-agent-worker in the list. How do I get this fix?

Are you running Ubuntu? If so, you likely are using the Snapped version of the SSM Agent and you would need to get the update via the Snap store.

snap list

snap refresh --list

snap info amazon-ssm-agent

sudo snap refresh

https://snapcraft.io/amazon-ssm-agent

https://snapcraft.io/docs/getting-started#heading--refreshing

https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-ubuntu-64-snap.html

jgfoster commented 3 months ago

Ah, I see. I didn't realize that the fix was implemented nine (9) months ago!