ansible / receptor

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.
Other
160 stars 81 forks source link

receptor: Error locating unit #758

Open anxstj opened 1 year ago

anxstj commented 1 year ago

Please confirm the following

Bug Summary

My receptor services on my execution nodes show the following errors:

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

It seems that it shows up whenever a job finishes. The jobs are working, though. And AWX doesn't show any additional error messages.

What could cause this? And how can I debug it?

I'm running AWX 21.5.0 and receptor 1.2.0+g72a97e5

Receptor is installed with the AWX image:

Dockerfile:

COPY --from={{ receptor_image }} /usr/bin/receptor /usr/bin/receptor

Makefile:

RECEPTOR_IMAGE ?= [quay.io/ansible/receptor:devel](http://quay.io/ansible/receptor:devel)

AWX version

21.5.0

Select the relevant components

Installation method

docker development environment

Modifications

no

Ansible version

2.12.2

Operating system

Debian 11

Web browser

Firefox

Steps to reproduce

Create a setup with two controller nodes and two execution nodes. Then execute a job on one of the execution nodes. The job should succeed, but receptor will log a similar error message as mentioned above with the end of the job.

Expected results

No error message.

Actual results

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

Additional information

No response

fosterseth commented 1 year ago

@anxstj thanks for opening the ticket! my hunch is that AWX is trying to release or cancel old receptor work units somewhere (i.e. reaper code). Needs some investigation

anxstj commented 1 year ago

I just found out that old podman instances are not cleaned up successfully. They stay as zombies on the system:

ps faux
...
1000        7586  0.9  0.3 807216 57784 ?        Ssl  Sep27 270:41  \_ receptor --config /etc/receptor/receptor.conf
1000        8004  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [podman] <defunct>
1000        8009  0.0  0.0   1088     0 ?        S    Sep27   0:00      \_ catatonit -P
1000        8669  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [slirp4netns] <defunct>
1000        8691  0.0  0.0      0     0 ?        Zs   Sep27   0:09      \_ [fuse-overlayfs] <defunct>
1000        8699  0.0  0.0      0     0 ?        Zs   Sep27   0:00      \_ [conmon] <defunct>

In the long run, this will cause trouble, e.g. the systemd MaxTasks limit will be reached:

cgroup: fork rejected by pids controller in /system.slice/...

Could this be related to https://github.com/ansible/receptor/issues/439 ? (Just an uneducated guess)

anxstj commented 1 year ago

Could this be related to #439 ? (Just an uneducated guess)

FTR: the receptor container had a wrong entrypoint that prevented the container to be cleaned up.

golakiyaalice commented 10 months ago

I am running awx-operator:2.6.0 and facing the same issue while setting up executors on a VM . Is there any workaround for it? @all,please help.

vvarga007 commented 5 months ago

Any update on this?

golakiyaalice commented 5 months ago

I was able to fix this. my executor was running behind the firewall and podman was not able to fetch the image from the quay.io registry. Either get your container launched using the image available in your environment or either make sure your executor is able to reach to the quay repos. This issue can be closed.

golakiyaalice commented 5 months ago

Any update on this?

Yes this is a feature of awx. This issue can be closed.

golakiyaalice commented 5 months ago

Response added in the conversation and can be closed

Warm Regards, Alice Golakiya +61-401531992

On Thu, 11 Apr, 2024, 17:57 Viktor Varga, @.***> wrote:

Any update on this?

— Reply to this email directly, view it on GitHub https://github.com/ansible/receptor/issues/758#issuecomment-2050627282, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTRZOCN5BFJSSHON3WSRCDY44BNNAVCNFSM6AAAAAAWMCRIPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGYZDOMRYGI . You are receiving this because you commented.Message ID: @.***>