AWX not able to delete the worker pods after finished running

chinna44 commented 5 months ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

We have recently upgraded the awx version from 22.5.0 to 23.9.0 which is deployed on EKS 1.28 version.

After AWX upgrade, we observed that only few jobs (not all jobs) running on workers pods specific to inventory sync are not getting deleted even after job workflow is completed . The pods will be in queue for hours and days until we delete them manually. I don't see any other errors

The worker pods status is shown below NAME READY STATUS RESTARTS AGE automation-job-462026-6zf7c 1/2 NotReady 0 3m23s

The errors that are captured from awx control plane ee logs for the worker pods that are not getting deleted Error deleting pod automation-job-462026-6zf7c: client rate limiter Wait returned an error: context canceled Context was canceled while reading logs for pod awx-workers/automation-job-462026-6zf7c. Assuming pod has finished

The pod status description shows: Not displaying the data that is condifential Containers: worker: State: Terminated Reason: Completed Exit Code: 0 Ready: False Restart Count: 0 authenticator: State: Running Ready: True Restart Count: 0

The automation-job-462026-6zf7c pod contains two containers: worker and authenticator.

When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container and authenticator container worker-container.txt authenticator-container.txt

For now we are testing this in non production environment, currently its a blocker to upgrade the production. Please have a look and provide the fix or suggest the best awx version if it is a known issue

AWX version

23.9.0

Select the relevant components

[ ] UI
[ ] UI (tech preview)
[X] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run many AWX jobs based on the pod that contains worker and authenticator images.(we observed mainly on Inventory sync jobs)

Expected results

AWX deletes all the pods that finished running.

Actual results

AWX Worker pods got stuck

Additional information

No response

chronicc commented 5 months ago

I observe the same issues on Kubernetes 1.27 with AWX 23.0.0.

The pods that are not deleted are pods where the awx jobs have been deleted immediately after the pod failed. It looks like awx only knows about existing pods through the jobs inside of awx.

If this is the case, the pod should be actively be removed from Kubernetes when the job is deleted OR the api output of the job should give a hint on whether the pod has already been deleted inside kubernetes.

TheRealHaoLiu commented 5 months ago

can u give us the output of /api/v2/jobs/462026

chinna44 commented 5 months ago

@TheRealHaoLiu below is the output. I want to highlight again, pod is not deleting only for few inventory sync jobs which is completed successfully

ansible-inventory [core 2.15.5] config file = /ansible.cfg configured module search path = ['/cyberark-ansible-modules/lib/ansible/modules', '/runner/project'] ansible python module location = /usr/local/lib/python3.9/site-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections executable location = /usr/local/bin/ansible-inventory python version = 3.9.18 (main, Jan 24 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3) jinja version = 3.0.0 libyaml = False Using /ansible.cfg as config file [DEPRECATION WARNING]: DEFAULT_GATHER_TIMEOUT option, the module_defaults keyword is a more generic version and can apply to all calls to the M(ansible.builtin.gather_facts) or M(ansible.builtin.setup) actions, use module_defaults instead. This feature will be removed from ansible-core in version 2.18. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. redirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2 Using inventory plugin 'ansible_collections.amazon.aws.plugins.inventory.aws_ec2' to process inventory source '/runner/inventory/aws_ec2.yml' Parsed /runner/inventory/aws_ec2.yml inventory source with auto plugin 8.867 INFO Processing JSON output... 8.868 INFO Loaded 1 groups, 0 hosts 8.898 INFO Inventory import completed for AWS-sandbox-Windows in 0.0s

TheRealHaoLiu commented 5 months ago

@chinna44 that does not look like the output from the api endpoint... that looks like the stdout of the job

chinna44 commented 5 months ago

@TheRealHaoLiu yes.. you are correct, I'm sorry for that

Below is the output for endpoint /api/v2/jobs/462026, but I could not see the details of job for this kind of any other Inventory Sync jobs. Please let me know if you require details in any other possible ways

HTTP 404 Not Found Allow: GET, DELETE, HEAD, OPTIONS Content-Type: application/json Vary: Accept X-API-Node: awx-web-c8bc64f45-h7xwt X-API-Product-Name: AWX X-API-Product-Version: 23.9.0 X-API-Time: 0.057s

{ "detail": "Not found." }

chinna44 commented 5 months ago

@TheRealHaoLiu please let me know if you need any other details

BartOpitz commented 2 months ago

Hi. We face the same issues, but also job pods are sometimes not removed in k8s. Mostly job pods that end with an error and not OK, but there are some successfull pods as well hanging.

We observed beginning of this type of problems after updating AWX from 24.5.0 to 24.6.1 and upgrade of k8s to 1.30.3 which sadly both took place the same day. Before the upgrage we did not observe this kind of problems and we were already running at least k8s 1.28 (cannot confirm precise version currently)

Update: Ok, I take everything above back. It showed up, that during troubleshooting one from our admins has set

RECEPTOR_RELEASE_WORK = False        # Default True
RECPETOR_KEEP_WORK_ON_ERROR = True   # Default False

This explains the behaviour we had. After reverthing those back to default all left over pods were removed immediately by AWX and no new pods are left behind. So everything works as expected at least within mentioned verisons.

ansible / awx