Open fkuep opened 1 year ago
@chadmf how feasible would it be to integrate k8s node level error messaging into awx? seems this kind of problem might need to be solved outside of awx
@fkuep this might be a good question for the mailing list and see if others have come up with some monitoring solution for detecting max_user_watches depletion
@chadmf how feasible would it be to integrate k8s node level error messaging into awx? seems this kind of problem might need to be solved outside of awx
@fkuep this might be a good question for the mailing list and see if others have come up with some monitoring solution for detecting max_user_watches depletion
@fosterseth @chadmf You are aware that we have indications that awx is losing stderr in a scenario common during evaluation ?
@fosterseth on : "monitoring solution for detecting max_user_watches depletion" I don´t want to count and compare the percentage in realtime - since it might be costly itsself and furthermore I think it is out of scope.
What I want to do now for myself is make an alert of syslog messages like (this was from /var/log/damon.log, could be in syslog):
Aug 9 10:52:45 kubelet-worker-02 containerd[816]: time="2023-08-09T10:52:45.331075984+02:00" level=warning msg="error from *cgroupsv2.Manager.EventChan" error="failed to create inotify fd"
If this going to be the solution, I think it would greatly help, to inform the users in the installation documentation that awx is making good use on this resource and suggest watching for this type of error because awx cannot get reach of it itsself.
Please confirm the following
security@ansible.com
instead.)Bug Summary
When inotify file watches are depleted, jobs fail without STDOUT / or -ERR and I am left with an error level. User regularly stumble upon the solution when trying to log-tail the awx containers after hours of investigation. e.g.: https://github.com/ansible/awx/issues/10366#issuecomment-904627536 I would need better feedback to awx´s health in these situations. It would be a fantastic thing to be alerted to the situation, since most of the modern log-pullers (e.g elastic) will probably be failing at the same time.
AWX version
22.5.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
awx ee latest
Operating system
linux
Web browser
No response
Steps to reproduce
echo 500 > /proc/sys/fs/inotify/max_user_watches run a job template
Expected results
Some notifiacation about a problem to watch files and directories or the depletion of inotify watches.
Actual results
jobs fail with an errorlevel and not much else.
Additional information
No response