aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

ssm-document-worker timedout : ERROR watcher too many open files #330

Closed copolycube closed 3 years ago

copolycube commented 3 years ago

Hello,

I'm using the ssm agent to apply the AWS-RunPatchBaseline document, but I faced the following errors during last weekend maintenance window execution. This only concerns Ubuntu 14.04 servers, and the instances where I had the same issues happening did not even write outputs to the configured s3 bucket.

I'm not 100% sure on the root cause of this issue and any insights would be helpful.

I applied the same on Ubuntu 16 servers at the same time, but only Ubuntu 14 servers raised this error (same error on each server).

2020-11-28 05:01:39 ERROR [NewFileWatcherChannel @ filechannel.go.78] [ssm-document-worker] [7217491f-7cd4-4b9d-8506-c1dd4eae7e27] filewatcher listener encountered error when start watcher: too many open files
2020-11-28 05:01:39 ERROR [main @ main.go.72] [ssm-document-worker] [7217491f-7cd4-4b9d-8506-c1dd4eae7e27] failed to create channel: too many open files
2020-11-28 05:01:42 ERROR [Messaging @ messaging.go.92] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [7217491f-7cd4-4b9d-8506-c1dd4eae7e27] received timedout signal!
2020-11-28 05:01:42 ERROR [messaging @ master.go.114] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [7217491f-7cd4-4b9d-8506-c1dd4eae7e27] messaging worker encountered error: document worker timed out
2020-11-28 05:30:50 ERROR [NewFileWatcherChannel @ filechannel.go.78] [ssm-document-worker] [8494e239-21ea-479c-8973-87b3da3d4c2c.2020-11-28T04-30-50.029Z] filewatcher listener encountered error when start watcher: too many open files
2020-11-28 05:30:50 ERROR [main @ main.go.72] [ssm-document-worker] [8494e239-21ea-479c-8973-87b3da3d4c2c.2020-11-28T04-30-50.029Z] failed to create channel: too many open files
2020-11-28 05:30:53 ERROR [Messaging @ messaging.go.92] [MessagingDeliveryService] [Association] [EngineProcessor] [OutOfProcExecuter] [8494e239-21ea-479c-8973-87b3da3d4c2c.2020-11-28T04-30-50.029Z] received timedout signal!
2020-11-28 05:30:53 ERROR [messaging @ master.go.114] [MessagingDeliveryService] [Association] [EngineProcessor] [OutOfProcExecuter] [8494e239-21ea-479c-8973-87b3da3d4c2c.2020-11-28T04-30-50.029Z] messaging worker encountered error: document worker timed out

But there doesn't seem to be any limits on the affected servers, where I had "unlimited" as a result of ulimit -S and -H

$ ulimit -S
unlimited
$ ulimit -H
unlimited
$ sudo status amazon-ssm-agent
amazon-ssm-agent start/running, process 22263
$ cat /proc/22263/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             62104                62104                processes
Max open files            1024                 4096                 files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       62104                62104                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

I tried identifying similar issues already and found the following, but the logs are not exactly the same, and there don't seem to be any definitive answer...

https://github.com/aws/amazon-ssm-agent/issues/265

From my understanding this seemed linked to the ssm worker, so I thought that this was the proper place where to ask for help.

Thanks in advance for any help in debugging this

Thor-Bjorgvinsson commented 3 years ago

Hey Copolycube, thanks for reaching out to us. I have seen a similar issue before with ubuntu 14. The issue then was that the pre-installed sql server was taking up a lot of the system limit for open files.

Could you take a look at man lsof command and see if you can query for how many files are open on your ubuntu 14.04 system, what process has the most files open and how many the agent process actually has open?

copolycube commented 3 years ago

Thanks Thor, I'm going to have a look at the usage of mysql or pgsql on those instances !

I also found out that I actually had issues on those instances with the automated update of the ssm agent, and I had forgotten to publish logs for those automatons hence not trivial to debug. Edit : that was actually not a good call, even the ssm agent update fails, with the same error.

$ lsof | grep mysql | wc -l
156
$ lsof | wc -l
2712
$  ulimit -Sa
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62104
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62104
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Thor-Bjorgvinsson commented 3 years ago

You can use the commands below to print total number of file handles for a process, commands below do this for ssm agent and mysqld.

sudo lsof | awk '/^mysqld/{print $NF}' | sort | uniq -c | awk '{ sum += $1; } END {print sum; }'
sudo lsof | awk '/^ssm-agent/{print $NF}' | sort | uniq -c | awk '{ sum += $1; } END {print sum; }'

If mysqld is very high and you are not using it, see if turning that off resolves your issue. I would be interested to know what the usage of both of these are on your system

copolycube commented 3 years ago

Thanks a lot for your commands Thor !

$ sudo lsof | awk '/^amazon-ss/{print $NF}' | sort | uniq -c | awk '{ sum += $1; } END {print sum; }'
240
$ sudo lsof | awk '/^mysqld/{print $NF}' | sort | uniq -c | awk '{ sum += $1; } END {print sum; }'
16236

But this kind of makes sense since I've got a web application using mysql running on this server and the ssm-agent is inactive at the time ( Note : ^ssm-agent was not returning anything, hence th modification to what I observed with lsof amazon-ss ).

Thor-Bjorgvinsson commented 3 years ago

Thanks for posting these findings, as the number of files that the agent is using is under the per process system limit. Could you attempt to increase the hard/soft limits of the system and see if that resolves your problem?

Closing issue as this is a OS issue, feel free to reopen if you disagree

torson commented 1 year ago

I've had the same issue on latest Snap package amazon-ssm-agent 3.1.1732.0 on Ubuntu 20.04 . Service restart didn't help.

The solution is to increase the file descriptor limit for the amazon-ssm-agent service. Create a file /etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service.d/filelimit.conf (filename can be anything) :

sudo mkdir -p /etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service.d/
sudo tee /etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service.d/filelimit.conf <<EOF
[Service]
LimitNOFILE=65536
EOF

And restart the service:

systemctl restart snap.amazon-ssm-agent.amazon-ssm-agent.service

This file will persist on Snap package upgrades .

torson commented 1 year ago

It turned out another configuration tweak was needed:

echo fs.inotify.max_user_watches=524288 >> /etc/sysctl.conf
echo fs.inotify.max_user_instances=256 >> /etc/sysctl.conf
sysctl -p

This then ultimately solved the issue.