Open chinna44 opened 5 months ago
I observe the same issues on Kubernetes 1.27 with AWX 23.0.0.
The pods that are not deleted are pods where the awx jobs have been deleted immediately after the pod failed. It looks like awx only knows about existing pods through the jobs inside of awx.
If this is the case, the pod should be actively be removed from Kubernetes when the job is deleted OR the api output of the job should give a hint on whether the pod has already been deleted inside kubernetes.
can u give us the output of /api/v2/jobs/462026
ansible-inventory [core 2.15.5] config file = /ansible.cfg configured module search path = ['/cyberark-ansible-modules/lib/ansible/modules', '/runner/project'] ansible python module location = /usr/local/lib/python3.9/site-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections executable location = /usr/local/bin/ansible-inventory python version = 3.9.18 (main, Jan 24 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3) jinja version = 3.0.0 libyaml = False Using /ansible.cfg as config file [DEPRECATION WARNING]: DEFAULT_GATHER_TIMEOUT option, the module_defaults keyword is a more generic version and can apply to all calls to the M(ansible.builtin.gather_facts) or M(ansible.builtin.setup) actions, use module_defaults instead. This feature will be removed from ansible-core in version 2.18. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. redirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2 Using inventory plugin 'ansible_collections.amazon.aws.plugins.inventory.aws_ec2' to process inventory source '/runner/inventory/aws_ec2.yml' Parsed /runner/inventory/aws_ec2.yml inventory source with auto plugin 8.867 INFO Processing JSON output... 8.868 INFO Loaded 1 groups, 0 hosts 8.898 INFO Inventory import completed for AWS-sandbox-Windows in 0.0s
@chinna44 that does not look like the output from the api endpoint... that looks like the stdout of the job
@TheRealHaoLiu yes.. you are correct, I'm sorry for that
Below is the output for endpoint /api/v2/jobs/462026, but I could not see the details of job for this kind of any other Inventory Sync jobs. Please let me know if you require details in any other possible ways
HTTP 404 Not Found Allow: GET, DELETE, HEAD, OPTIONS Content-Type: application/json Vary: Accept X-API-Node: awx-web-c8bc64f45-h7xwt X-API-Product-Name: AWX X-API-Product-Version: 23.9.0 X-API-Time: 0.057s
{ "detail": "Not found." }
@TheRealHaoLiu please let me know if you need any other details
Hi. We face the same issues, but also job pods are sometimes not removed in k8s. Mostly job pods that end with an error and not OK, but there are some successfull pods as well hanging.
We observed beginning of this type of problems after updating AWX from 24.5.0 to 24.6.1 and upgrade of k8s to 1.30.3 which sadly both took place the same day. Before the upgrage we did not observe this kind of problems and we were already running at least k8s 1.28 (cannot confirm precise version currently)
Update: Ok, I take everything above back. It showed up, that during troubleshooting one from our admins has set
RECEPTOR_RELEASE_WORK = False # Default True
RECPETOR_KEEP_WORK_ON_ERROR = True # Default False
This explains the behaviour we had. After reverthing those back to default all left over pods were removed immediately by AWX and no new pods are left behind. So everything works as expected at least within mentioned verisons.
Please confirm the following
security@ansible.com
instead.)Bug Summary
We have recently upgraded the awx version from 22.5.0 to 23.9.0 which is deployed on EKS 1.28 version.
After AWX upgrade, we observed that only few jobs (not all jobs) running on workers pods specific to inventory sync are not getting deleted even after job workflow is completed . The pods will be in queue for hours and days until we delete them manually. I don't see any other errors
The worker pods status is shown below NAME READY STATUS RESTARTS AGE automation-job-462026-6zf7c 1/2 NotReady 0 3m23s
The errors that are captured from awx control plane ee logs for the worker pods that are not getting deleted Error deleting pod automation-job-462026-6zf7c: client rate limiter Wait returned an error: context canceled Context was canceled while reading logs for pod awx-workers/automation-job-462026-6zf7c. Assuming pod has finished
The pod status description shows: Not displaying the data that is condifential Containers: worker: State: Terminated Reason: Completed Exit Code: 0 Ready: False Restart Count: 0 authenticator: State: Running Ready: True Restart Count: 0
The automation-job-462026-6zf7c pod contains two containers: worker and authenticator.
When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container and authenticator container worker-container.txt authenticator-container.txt
For now we are testing this in non production environment, currently its a blocker to upgrade the production. Please have a look and provide the fix or suggest the best awx version if it is a known issue
AWX version
23.9.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Run many AWX jobs based on the pod that contains worker and authenticator images.(we observed mainly on Inventory sync jobs)
Expected results
AWX deletes all the pods that finished running.
Actual results
AWX Worker pods got stuck
Additional information
No response