ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.07k stars 3.42k forks source link

AWX Jobs Failing with "Task was canceled due to receiving a shutdown signal." #14948

Closed mmacdo02-tufts closed 8 months ago

mmacdo02-tufts commented 8 months ago

Please confirm the following

Bug Summary

Long running Ansible jobs are failing with no other information. We have AWX 23.8.0 installed on OpenShift 4.11.57 using the AWX-Operator. I did check the current issues for duplicates so I apologies if this is a duplicate bug.

I am able to replicate this problem in both my Lab and Production environments which run on different OpenShift clusters. Both are the same version of AWX (23.8.0) with same AWX operator (awx-operator.v2.12.0) and same version of Red Hat OpenShift 4.11.57. All long running jobs fail the same way.

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – env | grep ANSIBLE_RUNNER_KEEPALIVE_SECOND ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – receptor --version 1.4.4+gc75b1f6

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – ansible-runner --version 2.3.5

I’m happy to provide more information but I am pretty new to AWX. I did increase our containerLogMaxSize to 200mb for better visibility. I also set K8S Ansible Runner Keep-Alive Message Interval to 30.

Right now I am just trying to run a simple Ansible playbook that simply pauses for 120 minutes for troubleshooting / debugging. This job will always fail.

AWX version

23.8.0

Select the relevant components

Installation method

openshift

Modifications

no

Ansible version

No response

Operating system

OpenShift 4.11.57

Web browser

Chrome

Steps to reproduce

Within AWX, the Task shows Failed: Task was canceled due to receiving a shutdown signal. I am just running a very similar Ansible playbook that pauses for 120 minutes to replicate the issue. I cannot figure out what is sending a shutdown to the automation

`- name: Test long running job in AWX hosts: localhost connection: local gather_facts: no become: no tasks:

Screenshot 2024-03-04 153340

awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log

Expected results

I expect the Ansible job to run successfully without timing out.

Actual results

Every job fails with Task was canceled due to receiving a shutdown signal.

I can see the automation-job pod terminate but I cannot figure out what is causing this pod to terminate before the Ansible job is completed.

Additional information

No response

mmacdo02-tufts commented 8 months ago

I've also attached logs from awx-task pod awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log

mmacdo02-tufts commented 8 months ago

This appears to be a duplicate of https://github.com/ansible/awx/issues/14876

It says it's resolved in AWX 23.8.1 and Operator 2.12.1