AWX can't delete worker pods, that have already finished running

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

We recently upgraded our AWX version from 21.10.2 to 22.3.0. We are running it on EKS cluster 1.24.

After the upgrade, some AWX worker pods are stuck in the "NotReady" state. The number of these kinds of pods increases gradually till our cluster gets filled with these kinds of pods that were not automatically deleted. In AWX's UI, it looks ok; I can't see any stuck jobs there. It happens only sometimes; the pods are usually successfully deleted after the job is finished running.

This is what we see in the logs of the Control Plane EE (in this case, the worker pod with the name automation-job-6991781-64z2g is a pod that can not be deleted for some reason):

ERROR 2023/06/11 08:46:56 [p8SNzi65] Error reading from pod awx-workers/automation-job-6991781-64z2g: context canceled
ERROR 2023/06/11 08:46:56 Error deleting pod automation-job-6991781-64z2g: client rate limiter Wait returned an error: context canceled

The automation-job-6991781-64z2g pod contains two containers: worker and authenticator.

When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container (the ending of the logs)

{"status": "starting", "runner_ident": "6992145", "command": ["ansible-inventory", "--list", "--export", "-i", "/runner/inventory/aws_ec2.yml", "--output", "/runner/artifacts/6992145/output.json", "--playbook-dir", "/runner/inventory", "-vvv"], SOME CONFIDENTIAL INFO HIDDEN
{"status": "running", "runner_ident": "6992145"}
{"event": "verbose", "uuid": "4a9ce396-2648-402d-941e-82fdssdc2fc", "counter": 1, "stdout": "\u001b[0;34mansible-inventory [core 2.12.10]\u001b[0m", "start_line": 0, "end_line": 1, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631203"}
{"event": "verbose", "uuid": "ea8a1563-aac5-49c8-a4a9-14sdfs372f2", "counter": 2, "stdout": "\u001b[0;34m  config file = /home/runner/ansible.cfg\u001b[0m", "start_line": 1, "end_line": 2, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631394"}
{"event": "verbose", "uuid": "ae63bef6-abfe-4825-953b-6sdsdsc18eda8", "counter": 3, "stdout": "\u001b[0;34m  configured module search path = ['/home/runner/cyberark-ansible-modules/lib/ansible/modules', '/runner/project']\u001b[0m", "start_line": 2, "end_line": 3, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631483"}
{"event": "verbose", "uuid": "7b44bba2-9d7a-4349-ab4f-95csdf69bc5", "counter": 4, "stdout": "\u001b[0;34m  ansible python module location = /usr/local/lib/python3.8/site-packages/ansible\u001b[0m", "start_line": 3, "end_line": 4, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631570"}
{"event": "verbose", "uuid": "f950ff27-add9-4cbd-bf0b-43esdcs5aa39", "counter": 5, "stdout": "\u001b[0;34m  ansible collection location = /home/runner/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections\u001b[0m", "start_line": 4, "end_line": 5, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631631"}
{"event": "verbose", "uuid": "45ad3fd5-772a-4062-be61-8c4fds163d4b", "counter": 6, "stdout": "\u001b[0;34m  executable location = /usr/local/bin/ansible-inventory\u001b[0m", "start_line": 5, "end_line": 6, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631701"}
{"event": "verbose", "uuid": "9955b23f-d927-4309-8827-76dcs21e5", "counter": 7, "stdout": "\u001b[0;34m  python version = 3.8.13 (default, Jun 24 2022, 15:27:57) [GCC 8.5.0 20210514 (Red Hat 8.5.0-13)]\u001b[0m", "start_line": 6, "end_line": 7, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631767"}
{"event": "verbose", "uuid": "56b033be-a9b9-496d-951f-b8f6sdcb13", "counter": 8, "stdout": "\u001b[0;34m  jinja version = 2.11.2\u001b[0m", "start_line": 7, "end_line": 8, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631837"}
{"event": "verbose", "uuid": "f53300f5-7e8e-4203-9dd8-d7sdca33fa", "counter": 9, "stdout": "\u001b[0;34m  libyaml = True\u001b[0m", "start_line": 8, "end_line": 9, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.631929"}
{"event": "verbose", "uuid": "086ebe47-5cfe-4196-bfad-672sdc0f32d", "counter": 10, "stdout": "\u001b[0;34mUsing /home/runner/ansible.cfg as config file\u001b[0m", "start_line": 9, "end_line": 10, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.632006"}
{"event": "verbose", "uuid": "420bf536-7f1f-4afc-9afc-4d3sdcbe9a", "counter": 11, "stdout": "\u001b[0;34mredirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2\u001b[0m", "start_line": 10, "end_line": 11, "runner_ident": "6992145", "created": "2023-06-11T09:15:19.642603"}
{"event": "verbose", "uuid": "28a48fb3-92fe-4fd3-883a-a5e8sdc3ff", "counter": 12, "stdout": "\u001b[0;34mParsed /runner/inventory/aws_ec2.yml inventory source with auto plugin\u001b[0m", "start_line": 11, "end_line": 12, "runner_ident": "6992145", "created": "2023-06-11T09:15:20.380571"}
{"status": "successful", "runner_ident": "6992145"}
{"zipfile": 2945}
UEsDBBQAAAAIAOlJy1ZOaLONG_STRING_HERE{"eof": true}

The authenticator container logs:

INFO:  2023/06/11 09:15:17.838185 main.go:19: CAKC048 Kubernetes Authenticator Client v0.25.0-1340de821a8 starting up...
INFO:  2023/06/11 09:15:17.838219 configuration_factory.go:80: CAKC070 Chosen "authn-k8s" configuration
INFO:  2023/06/11 09:15:17.838241 authenticator_factory.go:31: CAKC075 Chosen "authn-k8s" flow
INFO:  2023/06/11 09:15:17.864102 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:15:18.693877 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:15:18.693900 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:21:18.792953 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:21:19.269918 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:21:19.269940 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:27:19.296108 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:27:19.791494 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:27:19.791517 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:33:19.839317 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:33:20.329509 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:33:20.329529 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:39:20.329661 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:39:20.893922 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:39:20.893944 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO:  2023/06/11 09:45:20.894067 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO:  2023/06/11 09:45:21.393847 authenticator.go:116: CAKC035 Successfully authenticated
INFO:  2023/06/11 09:45:21.393874 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate

We could suppose it is some authenticator issue, but the same workflows work correctly on the production environment, which is still running on the older AWX version (21.10.2). If I delete these stuck worker pods manually, they get successfully deleted. All updated environments have this bug, and we must make some cleanup-related workarounds that will delete all NotReady pods, to resolve this issue temporarily.

For now, this bug stops us from upgrading the production environment. We will be glad to come up with clarifications if needed.

UPDATE:

After some research, we can see that some jobs stuck in the NotReady state with the same error in the control-plane log get eventually deleted. But the jobs that don't get deleted are running the ["ansible-inventory", "--list", "--export", "-i", "/runner/inventory/aws_ec2.yml"] (see the logs above, all the pods that never get deleted have the same output, except the IDs, that seems to be an interesting clue).
It is interesting that we can't find these problematic jobs in AWX UI by the job_id parameter. So they are dummy jobs that don't exist in UI.
AWX version

22.3.0

Select the relevant components

[ ] UI
[ ] UI (tech preview)
[X] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

Installation method

kubernetes operator

Modifications

Steps to reproduce

Run many AWX jobs based on the pod that contains worker and authenticator images.

Expected results

AWX deletes all the pods that finished running.

Actual results

Some pods get stuck

ansible / awx