aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

AWS-RunPatchBaseline - mitigating "[OutOfProcExecuter] IPC file not readable: tmp" #346

Closed copolycube closed 3 years ago

copolycube commented 3 years ago

Hello,

I am facing some issue with the SSM maintenance windows I configured and am progressively rolling out over a fleet of several hundreds of servers. I tried to segregate the issues as I encounter them and will hence focus on a single one I observed and I'm not able to find informations about it online so far.

All my maintenance windows are configured so that /tmp be fully writable, and I don't have space issue :

$ ll / | grep tmp
drwxrwxrwt   7 root root       4096 Jan 19 22:12 tmp/
$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1   12G  8.7G  2.5G  78% /

But I observe recurrently the following message (extended extract bellow) in the logs generated locally in /var/log/amazon/ssm/amazon-ssm-agent.log :

 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] inter process communication started
 ERROR [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] received timedout signal!
 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter]  channel /var/lib/amazon/ssm/i-<ec2id_removed>/channels/<cmdId> requested close

Would anyone have hints on how I should I interpret this, and what resolution could I apply ? What mitigation would you suggest ?

Thanks in advance, any help or suggestion would be greatly appreciated !

#sudo grep -RH 39799423-c166-4532-953f-ae9db314dbac /var/log/amazon/ssm

/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:32 INFO [ssm-agent-worker] [MessagingDeliveryService] [messageID=aws.ssm.39799423-c166-4532-953f-ae9db314dbac.i-<ec2id_removed>] Validating SSM parameters
/var/log/amazon/ssm/amazon-ssm-agent.log.32:  MessageId: "aws.ssm.39799423-c166-4532-953f-ae9db314dbac.i-<ec2id_removed>",
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO parsing args: [/usr/bin/ssm-document-worker 39799423-c166-4532-953f-ae9db314dbac i-<ec2id_removed>]
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO using channelName 39799423-c166-4532-953f-ae9db314dbac, instanceID: i-<ec2id_removed>
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] document: 39799423-c166-4532-953f-ae9db314dbac worker started
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] channel: 39799423-c166-4532-953f-ae9db314dbac found
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 ERROR [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] filewatcher listener encountered error when start watcher: too many open files
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 ERROR [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] failed to create channel: too many open files
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] channel: 39799423-c166-4532-953f-ae9db314dbac not found, creating a new file channel...
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 WARN [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] IPC file not readable: tmp
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:33 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] inter process communication started
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 ERROR [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] received timedout signal!
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] channel /var/lib/amazon/ssm/i-<ec2id_removed>/channels/39799423-c166-4532-953f-ae9db314dbac requested close
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] channel /var/lib/amazon/ssm/i-<ec2id_removed>/channels/39799423-c166-4532-953f-ae9db314dbac closed
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] ipc channel closed, stop messaging worker
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 ERROR [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] messaging worker encountered error: document worker timed out
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] document failed half way, sending fail message...
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] sending document: 39799423-c166-4532-953f-ae9db314dbac complete response
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] command: aws.ssm.39799423-c166-4532-953f-ae9db314dbac.i-<ec2id_removed> complete
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] Executer closed
/var/log/amazon/ssm/amazon-ssm-agent.log.32:2021-01-16 05:00:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] execution of aws.ssm.39799423-c166-4532-953f-ae9db314dbac.i-<ec2id_removed> is over. Removing interimState from current folder
/var/log/amazon/ssm/amazon-ssm-agent.log.32:  MessageId: "aws.ssm.39799423-c166-4532-953f-ae9db314dbac.i-<ec2id_removed>",
/var/log/amazon/ssm/errors.log:2021-01-16 05:00:33 ERROR [NewFileWatcherChannel @ filewatcherchannel.go.88] [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] filewatcher listener encountered error when start watcher: too many open files
/var/log/amazon/ssm/errors.log:2021-01-16 05:00:33 ERROR [main @ main.go.75] [ssm-document-worker] [39799423-c166-4532-953f-ae9db314dbac] failed to create channel: too many open files
/var/log/amazon/ssm/errors.log:2021-01-16 05:00:36 ERROR [New @ errors.go.59] [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] received timedout signal!
/var/log/amazon/ssm/errors.log:2021-01-16 05:00:36 ERROR [messaging @ master.go.118] [ssm-agent-worker] [MessagingDeliveryService] [EngineProcessor] [OutOfProcExecuter] [39799423-c166-4532-953f-ae9db314dbac] messaging worker encountered error: document worker timed out
Thor-Bjorgvinsson commented 3 years ago

IPC file not readable is a confusing error message and we have reduced the severity of the log line to debug: https://github.com/aws/amazon-ssm-agent/commit/bf3772897eb3d74d277c41c88f7a9593d99934c6

Too many files open is a separate issue where the agent reaches a OS limit of number of file handles allowed. We recently made fixes for where the logger was not cleaning up file handles correctly and was released in v3.0.529.0 https://github.com/aws/amazon-ssm-agent/commit/7d282d23e14cb652d0c80b5c989a7af227b4be48

If you still encounter issues with 'too many open files' error, please take a look at this previous ticket https://github.com/aws/amazon-ssm-agent/issues/330. Feel free to open another ticket if you still encounter issues with your setup.

copolycube commented 3 years ago

Hello @Thor-Bjorgvinsson and thanks a lot for your answer. I will make sure that all the servers ssm agents are upgraded (I also saw this ipc error in the auto-upgrade tasks so I fear that it's not upgraded everywhere consistently yet).