aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 324 forks source link

Feature: OOM killer protection similar to sshd #580

Open erpel opened 3 months ago

erpel commented 3 months ago

I initially opened this as part of amaonlinux, but it makes more sense in this project:

When the system is experiencing memory pressure, I've seen many times that ssm-agent gets killed by the OOM killer. This makes it hard to debug the situation if ssm-agent being killed results in being unable to log in and observe the situation.

I'd like ssm-agent to be run with the same OOM killer protections that sshd applies to it's own process (oom score adjustment -1000).

Alternatives would be to stop using SSM for login and switch to SSH, but this puts additional overhead on us, administering user accounts and ssh keys. SSM session manager is a useful feature that would really benefit from added efforts to increase stability.

This old bug https://bugzilla.redhat.com/show_bug.cgi?id=1010429#c0 contains some details about how it used to work with sshd - especially making sure that user processes spawned by the "protected" server don't inherit the strict protection of oom_score_adj -1000.

CuriousDolphin commented 2 months ago

I have the same problem, when the machine is saturated with ram, the ssm agent is killed and the only way is to restart it, is there any news?

gianniLesl commented 1 month ago

What OS is this on? The amazon-ssm-agent process should always restart if it crashes or is killed.

erpel commented 1 month ago

We've seen this on up to date versions of AmazonLinux 2023. In high load situations, it seems restarting does not work reliably or takes a very long time, causing issues with reaching instances.

Manually adjusting the OOM killer score in the systemd unit file (using an override for example) does help, so I feel that ssm-agent setting this automatically on the important process(es) is a good solution.