ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.67k stars 3.37k forks source link

No meaningfull error message when fs.inotify.max_user_watches are depleted. #14334

Open fkuep opened 11 months ago

fkuep commented 11 months ago

Please confirm the following

Bug Summary

When inotify file watches are depleted, jobs fail without STDOUT / or -ERR and I am left with an error level. User regularly stumble upon the solution when trying to log-tail the awx containers after hours of investigation. e.g.: https://github.com/ansible/awx/issues/10366#issuecomment-904627536 I would need better feedback to awx´s health in these situations. It would be a fantastic thing to be alerted to the situation, since most of the modern log-pullers (e.g elastic) will probably be failing at the same time.

AWX version

22.5.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

awx ee latest

Operating system

linux

Web browser

No response

Steps to reproduce

echo 500 > /proc/sys/fs/inotify/max_user_watches run a job template

Expected results

Some notifiacation about a problem to watch files and directories or the depletion of inotify watches.

Actual results

jobs fail with an errorlevel and not much else.

Additional information

No response

fosterseth commented 10 months ago

@chadmf how feasible would it be to integrate k8s node level error messaging into awx? seems this kind of problem might need to be solved outside of awx

@fkuep this might be a good question for the mailing list and see if others have come up with some monitoring solution for detecting max_user_watches depletion

https://groups.google.com/u/1/g/awx-project

fkuep commented 10 months ago

@chadmf how feasible would it be to integrate k8s node level error messaging into awx? seems this kind of problem might need to be solved outside of awx

@fkuep this might be a good question for the mailing list and see if others have come up with some monitoring solution for detecting max_user_watches depletion

https://groups.google.com/u/1/g/awx-project

@fosterseth @chadmf You are aware that we have indications that awx is losing stderr in a scenario common during evaluation ?

fkuep commented 10 months ago

@fosterseth on : "monitoring solution for detecting max_user_watches depletion" I don´t want to count and compare the percentage in realtime - since it might be costly itsself and furthermore I think it is out of scope.

What I want to do now for myself is make an alert of syslog messages like (this was from /var/log/damon.log, could be in syslog):

Aug  9 10:52:45 kubelet-worker-02 containerd[816]: time="2023-08-09T10:52:45.331075984+02:00" level=warning msg="error from *cgroupsv2.Manager.EventChan" error="failed to create inotify fd"

If this going to be the solution, I think it would greatly help, to inform the users in the installation documentation that awx is making good use on this resource and suggest watching for this type of error because awx cannot get reach of it itsself.