Closed mmacdo02-tufts closed 8 months ago
I've also attached logs from awx-task pod awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log
This appears to be a duplicate of https://github.com/ansible/awx/issues/14876
It says it's resolved in AWX 23.8.1 and Operator 2.12.1
Please confirm the following
security@ansible.com
instead.)Bug Summary
Long running Ansible jobs are failing with no other information. We have AWX 23.8.0 installed on OpenShift 4.11.57 using the AWX-Operator. I did check the current issues for duplicates so I apologies if this is a duplicate bug.
I am able to replicate this problem in both my Lab and Production environments which run on different OpenShift clusters. Both are the same version of AWX (23.8.0) with same AWX operator (awx-operator.v2.12.0) and same version of Red Hat OpenShift 4.11.57. All long running jobs fail the same way.
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – env | grep ANSIBLE_RUNNER_KEEPALIVE_SECOND ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – receptor --version 1.4.4+gc75b1f6
kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – ansible-runner --version 2.3.5
I’m happy to provide more information but I am pretty new to AWX. I did increase our containerLogMaxSize to 200mb for better visibility. I also set K8S Ansible Runner Keep-Alive Message Interval to 30.
Right now I am just trying to run a simple Ansible playbook that simply pauses for 120 minutes for troubleshooting / debugging. This job will always fail.
AWX version
23.8.0
Select the relevant components
Installation method
openshift
Modifications
no
Ansible version
No response
Operating system
OpenShift 4.11.57
Web browser
Chrome
Steps to reproduce
Within AWX, the Task shows Failed: Task was canceled due to receiving a shutdown signal. I am just running a very similar Ansible playbook that pauses for 120 minutes to replicate the issue. I cannot figure out what is sending a shutdown to the automation
`- name: Test long running job in AWX hosts: localhost connection: local gather_facts: no become: no tasks:
awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log
Expected results
I expect the Ansible job to run successfully without timing out.
Actual results
Every job fails with Task was canceled due to receiving a shutdown signal.
I can see the automation-job pod terminate but I cannot figure out what is causing this pod to terminate before the Ansible job is completed.
Additional information
No response