Open anxstj opened 1 year ago
@anxstj thanks for opening the ticket! my hunch is that AWX is trying to release or cancel old receptor work units somewhere (i.e. reaper code). Needs some investigation
I just found out that old podman instances are not cleaned up successfully. They stay as zombies on the system:
ps faux
...
1000 7586 0.9 0.3 807216 57784 ? Ssl Sep27 270:41 \_ receptor --config /etc/receptor/receptor.conf
1000 8004 0.0 0.0 0 0 ? Z Sep27 0:00 \_ [podman] <defunct>
1000 8009 0.0 0.0 1088 0 ? S Sep27 0:00 \_ catatonit -P
1000 8669 0.0 0.0 0 0 ? Z Sep27 0:00 \_ [slirp4netns] <defunct>
1000 8691 0.0 0.0 0 0 ? Zs Sep27 0:09 \_ [fuse-overlayfs] <defunct>
1000 8699 0.0 0.0 0 0 ? Zs Sep27 0:00 \_ [conmon] <defunct>
In the long run, this will cause trouble, e.g. the systemd MaxTasks limit will be reached:
cgroup: fork rejected by pids controller in /system.slice/...
Could this be related to https://github.com/ansible/receptor/issues/439 ? (Just an uneducated guess)
Could this be related to #439 ? (Just an uneducated guess)
FTR: the receptor container had a wrong entrypoint that prevented the container to be cleaned up.
I am running awx-operator:2.6.0 and facing the same issue while setting up executors on a VM . Is there any workaround for it? @all,please help.
Any update on this?
I was able to fix this. my executor was running behind the firewall and podman was not able to fetch the image from the quay.io registry. Either get your container launched using the image available in your environment or either make sure your executor is able to reach to the quay repos. This issue can be closed.
Any update on this?
Yes this is a feature of awx. This issue can be closed.
Response added in the conversation and can be closed
Warm Regards, Alice Golakiya +61-401531992
On Thu, 11 Apr, 2024, 17:57 Viktor Varga, @.***> wrote:
Any update on this?
— Reply to this email directly, view it on GitHub https://github.com/ansible/receptor/issues/758#issuecomment-2050627282, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTRZOCN5BFJSSHON3WSRCDY44BNNAVCNFSM6AAAAAAWMCRIPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGYZDOMRYGI . You are receiving this because you commented.Message ID: @.***>
Please confirm the following
Bug Summary
My receptor services on my execution nodes show the following errors:
It seems that it shows up whenever a job finishes. The jobs are working, though. And AWX doesn't show any additional error messages.
What could cause this? And how can I debug it?
I'm running AWX 21.5.0 and receptor 1.2.0+g72a97e5
Receptor is installed with the AWX image:
Dockerfile:
Makefile:
AWX version
21.5.0
Select the relevant components
Installation method
docker development environment
Modifications
no
Ansible version
2.12.2
Operating system
Debian 11
Web browser
Firefox
Steps to reproduce
Create a setup with two controller nodes and two execution nodes. Then execute a job on one of the execution nodes. The job should succeed, but receptor will log a similar error message as mentioned above with the end of the job.
Expected results
No error message.
Actual results
Additional information
No response