Open erpel opened 3 months ago
I have the same problem, when the machine is saturated with ram, the ssm agent is killed and the only way is to restart it, is there any news?
What OS is this on? The amazon-ssm-agent process should always restart if it crashes or is killed.
We've seen this on up to date versions of AmazonLinux 2023. In high load situations, it seems restarting does not work reliably or takes a very long time, causing issues with reaching instances.
Manually adjusting the OOM killer score in the systemd unit file (using an override for example) does help, so I feel that ssm-agent setting this automatically on the important process(es) is a good solution.
I initially opened this as part of amaonlinux, but it makes more sense in this project:
When the system is experiencing memory pressure, I've seen many times that ssm-agent gets killed by the OOM killer. This makes it hard to debug the situation if ssm-agent being killed results in being unable to log in and observe the situation.
I'd like ssm-agent to be run with the same OOM killer protections that sshd applies to it's own process (oom score adjustment -1000).
Alternatives would be to stop using SSM for login and switch to SSH, but this puts additional overhead on us, administering user accounts and ssh keys. SSM session manager is a useful feature that would really benefit from added efforts to increase stability.
This old bug https://bugzilla.redhat.com/show_bug.cgi?id=1010429#c0 contains some details about how it used to work with sshd - especially making sure that user processes spawned by the "protected" server don't inherit the strict protection of oom_score_adj -1000.