ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.97k stars 3.41k forks source link

AWX can't delete worker pods, that have already finished running #14107

Closed elibogomolnyi closed 1 year ago

elibogomolnyi commented 1 year ago

Please confirm the following

Bug Summary

We recently upgraded our AWX version from 21.10.2 to 22.3.0. We are running it on EKS cluster 1.24.

After the upgrade, some AWX worker pods are stuck in the "NotReady" state. The number of these kinds of pods increases gradually till our cluster gets filled with these kinds of pods that were not automatically deleted. In AWX's UI, it looks ok; I can't see any stuck jobs there. It happens only sometimes; the pods are usually successfully deleted after the job is finished running.

This is what we see in the logs of the Control Plane EE (in this case, the worker pod with the name automation-job-6991781-64z2g is a pod that can not be deleted for some reason):

ERROR 2023/06/11 08:46:56 [p8SNzi65] Error reading from pod awx-workers/automation-job-6991781-64z2g: context canceled
ERROR 2023/06/11 08:46:56 Error deleting pod automation-job-6991781-64z2g: client rate limiter Wait returned an error: context canceled

The automation-job-6991781-64z2g pod contains two containers: worker and authenticator.

When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container (the ending of the logs)

{"status": "starting", "runner_ident": "6992145", "command": ["ansible-inventory", "--list", "--export", "-i", "/runner/inventory/aws_ec2.yml", "--output", "/runner/artifacts/6992145/output.json", "--playbook-dir", "/runner/inventory", "-vvv"], SOME CONFIDENTIAL INFO HIDDEN
{"status": "running", "runner_ident": "6992145"}
{"event": "verbose", "uuid": "4a9ce396-2648-402d-941e-82fdssdc2fc", "counter": 1, "stdout": "\u001b[0;34mansible-inventory [core 2.12.10]\u001b[0m", "start_line": 0, "end_line": 1, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631203"}
{"event": "verbose", "uuid": "ea8a1563-aac5-49c8-a4a9-14sdfs372f2", "counter": 2, "stdout": "\u001b[0;34m  config file = /home/runner/ansible.cfg\u001b[0m", "start_line": 1, "end_line": 2, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631394"}
{"event": "verbose", "uuid": "ae63bef6-abfe-4825-953b-6sdsdsc18eda8", "counter": 3, "stdout": "\u001b[0;34m  configured module search path = ['/home/runner/cyberark-ansible-modules/lib/ansible/modules', '/runner/project']\u001b[0m", "start_line": 2, "end_line": 3, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631483"}
{"event": "verbose", "uuid": "7b44bba2-9d7a-4349-ab4f-95csdf69bc5", "counter": 4, "stdout": "\u001b[0;34m  ansible python module location = /usr/local/lib/python3.8/site-packages/ansible\u001b[0m", "start_line": 3, "end_line": 4, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631570"}
{"event": "verbose", "uuid": "f950ff27-add9-4cbd-bf0b-43esdcs5aa39", "counter": 5, "stdout": "\u001b[0;34m  ansible collection location = /home/runner/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections\u001b[0m", "start_line": 4, "end_line": 5, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631631"}
{"event": "verbose", "uuid": "45ad3fd5-772a-4062-be61-8c4fds163d4b", "counter": 6, "stdout": "\u001b[0;34m  executable location = /usr/local/bin/ansible-inventory\u001b[0m", "start_line": 5, "end_line": 6, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631701"}
{"event": "verbose", "uuid": "9955b23f-d927-4309-8827-76dcs21e5", "counter": 7, "stdout": "\u001b[0;34m  python version = 3.8.13 (default, Jun 24 2022, 15:27:57) [GCC 8.5.0 20210514 (Red Hat 8.5.0-13)]\u001b[0m", "start_line": 6, "end_line": 7, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631767"}
{"event": "verbose", "uuid": "56b033be-a9b9-496d-951f-b8f6sdcb13", "counter": 8, "stdout": "\u001b[0;34m  jinja version = 2.11.2\u001b[0m", "start_line": 7, "end_line": 8, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631837"}
{"event": "verbose", "uuid": "f53300f5-7e8e-4203-9dd8-d7sdca33fa", "counter": 9, "stdout": "\u001b[0;34m  libyaml = True\u001b[0m", "start_line": 8, "end_line": 9, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631929"}
{"event": "verbose", "uuid": "086ebe47-5cfe-4196-bfad-672sdc0f32d", "counter": 10, "stdout": "\u001b[0;34mUsing /home/runner/ansible.cfg as config file\u001b[0m", "start_line": 9, "end_line": 10, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.632006"}
{"event": "verbose", "uuid": "420bf536-7f1f-4afc-9afc-4d3sdcbe9a", "counter": 11, "stdout": "\u001b[0;34mredirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2\u001b[0m", "start_line": 10, "end_line": 11, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.642603"}
{"event": "verbose", "uuid": "28a48fb3-92fe-4fd3-883a-a5e8sdc3ff", "counter": 12, "stdout": "\u001b[0;34mParsed /runner/inventory/aws_ec2.yml inventory source with auto plugin\u001b[0m", "start_line": 11, "end_line": 12, "runner_ident": "6992145", "created": "2023-06-11T09:15:20.380571"}
{"status": "successful", "runner_ident": "6992145"}
{"zipfile": 2945}
UEsDBBQAAAAIAOlJy1ZOaLONG_STRING_HERE{"eof": true}

The authenticator container logs:

INFO:  2023/06/11 09:15:17.838185 main.go:19: CAKC048 Kubernetes Authenticator Client v0.25.0-1340de821a8 starting up...
INFO:  2023/06/11 09:15:17.838219 configuration_factory.go:80: CAKC070 Chosen "authn-k8s" configuration
INFO:  2023/06/11 09:15:17.838241 authenticator_factory.go:31: CAKC075 Chosen "authn-k8s" flow
INFO:  2023/06/11 09:15:17.864102 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:15:18.693877 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:15:18.693900 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:21:18.792953 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:21:19.269918 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:21:19.269940 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:27:19.296108 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:27:19.791494 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:27:19.791517 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:33:19.839317 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:33:20.329509 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:33:20.329529 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:39:20.329661 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:39:20.893922 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:39:20.893944 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:45:20.894067 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:45:21.393847 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:45:21.393874 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate

We could suppose it is some authenticator issue, but the same workflows work correctly on the production environment, which is still running on the older AWX version (21.10.2). If I delete these stuck worker pods manually, they get successfully deleted. All updated environments have this bug, and we must make some cleanup-related workarounds that will delete all NotReady pods, to resolve this issue temporarily.

For now, this bug stops us from upgrading the production environment. We will be glad to come up with clarifications if needed.

UPDATE:

22.3.0

Select the relevant components

Installation method

kubernetes operator

Modifications

no

Steps to reproduce

Run many AWX jobs based on the pod that contains worker and authenticator images.

Expected results

AWX deletes all the pods that finished running.

Actual results

Some pods get stuck

TheRealHaoLiu commented 1 year ago

Receptor bug we reverted a PR and hopefully it's fixed

TheRealHaoLiu commented 1 year ago

https://github.com/ansible/receptor/pull/787

elibogomolnyi commented 1 year ago

@TheRealHaoLiu, used the newer image of awx-ee and validated that there is no this bug anymore.

Thanks! Closing the issue.