AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.97k
stars
3.41k
forks
source link
AWX can't delete worker pods, that have already finished running #14107
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)
Bug Summary
We recently upgraded our AWX version from 21.10.2 to 22.3.0. We are running it on EKS cluster 1.24.
After the upgrade, some AWX worker pods are stuck in the "NotReady" state. The number of these kinds of pods increases gradually till our cluster gets filled with these kinds of pods that were not automatically deleted. In AWX's UI, it looks ok; I can't see any stuck jobs there. It happens only sometimes; the pods are usually successfully deleted after the job is finished running.
This is what we see in the logs of the Control Plane EE (in this case, the worker pod with the name automation-job-6991781-64z2g is a pod that can not be deleted for some reason):
ERROR 2023/06/11 08:46:56 [p8SNzi65] Error reading from pod awx-workers/automation-job-6991781-64z2g: context canceled
ERROR 2023/06/11 08:46:56 Error deleting pod automation-job-6991781-64z2g: client rate limiter Wait returned an error: context canceled
The automation-job-6991781-64z2g pod contains two containers: worker and authenticator.
When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container (the ending of the logs)
INFO: 2023/06/11 09:15:17.838185 main.go:19: CAKC048 Kubernetes Authenticator Client v0.25.0-1340de821a8 starting up...
INFO: 2023/06/11 09:15:17.838219 configuration_factory.go:80: CAKC070 Chosen "authn-k8s" configuration
INFO: 2023/06/11 09:15:17.838241 authenticator_factory.go:31: CAKC075 Chosen "authn-k8s" flow
INFO: 2023/06/11 09:15:17.864102 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:15:18.693877 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:15:18.693900 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO: 2023/06/11 09:21:18.792953 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:21:19.269918 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:21:19.269940 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO: 2023/06/11 09:27:19.296108 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:27:19.791494 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:27:19.791517 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO: 2023/06/11 09:33:19.839317 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:33:20.329509 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:33:20.329529 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO: 2023/06/11 09:39:20.329661 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:39:20.893922 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:39:20.893944 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
INFO: 2023/06/11 09:45:20.894067 authenticator.go:84: CAKC040 Authenticating as user 'host/conjur/authn-k8s'
INFO: 2023/06/11 09:45:21.393847 authenticator.go:116: CAKC035 Successfully authenticated
INFO: 2023/06/11 09:45:21.393874 main.go:56: CAKC047 Waiting for 6m0s to re-authenticate
We could suppose it is some authenticator issue, but the same workflows work correctly on the production environment, which is still running on the older AWX version (21.10.2). If I delete these stuck worker pods manually, they get successfully deleted. All updated environments have this bug, and we must make some cleanup-related workarounds that will delete all NotReady pods, to resolve this issue temporarily.
For now, this bug stops us from upgrading the production environment. We will be glad to come up with clarifications if needed.
UPDATE:
After some research, we can see that some jobs stuck in the NotReady state with the same error in the control-plane log get eventually deleted. But the jobs that don't get deleted are running the ["ansible-inventory", "--list", "--export", "-i", "/runner/inventory/aws_ec2.yml"] (see the logs above, all the pods that never get deleted have the same output, except the IDs, that seems to be an interesting clue).
It is interesting that we can't find these problematic jobs in AWX UI by the job_id parameter. So they are dummy jobs that don't exist in UI.
AWX version
22.3.0
Select the relevant components
[ ] UI
[ ] UI (tech preview)
[X] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other
Installation method
kubernetes operator
Modifications
no
Steps to reproduce
Run many AWX jobs based on the pod that contains worker and authenticator images.
Please confirm the following
security@ansible.com
instead.)Bug Summary
We recently upgraded our AWX version from 21.10.2 to 22.3.0. We are running it on EKS cluster 1.24.
After the upgrade, some AWX worker pods are stuck in the "NotReady" state. The number of these kinds of pods increases gradually till our cluster gets filled with these kinds of pods that were not automatically deleted. In AWX's UI, it looks ok; I can't see any stuck jobs there. It happens only sometimes; the pods are usually successfully deleted after the job is finished running.
This is what we see in the logs of the Control Plane EE (in this case, the worker pod with the name automation-job-6991781-64z2g is a pod that can not be deleted for some reason):
The automation-job-6991781-64z2g pod contains two containers: worker and authenticator.
When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container (the ending of the logs)
The authenticator container logs:
We could suppose it is some authenticator issue, but the same workflows work correctly on the production environment, which is still running on the older AWX version (21.10.2). If I delete these stuck worker pods manually, they get successfully deleted. All updated environments have this bug, and we must make some cleanup-related workarounds that will delete all NotReady pods, to resolve this issue temporarily.
For now, this bug stops us from upgrading the production environment. We will be glad to come up with clarifications if needed.
UPDATE:
AWX version
22.3.0
Select the relevant components
Installation method
kubernetes operator
Modifications
no
Steps to reproduce
Run many AWX jobs based on the pod that contains worker and authenticator images.
Expected results
AWX deletes all the pods that finished running.
Actual results
Some pods get stuck