The running ansible process received a shutdown signal.

Peter1295 commented 5 months ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

Random crashes with message The running ansible process received a shutdown signal. After the crash awx-task pod what was running job disappear from the Instances but pod is still running in the cluster.

Attaching logs from ArgoCD awx-task.txt

AWX version

AWX 24.4.0

Select the relevant components

[ ] UI
[ ] UI (tech preview)
[ ] API
[ ] Docs
[ ] Collection
[ ] CLI
[X] Other

Installation method

kubernetes

Modifications

no

Ansible version

v2.17.0

Operating system

k8s cluster on OL9

Web browser

Firefox, Chrome, Edge

Steps to reproduce

Created playbook with 6 5min pause commands and run template.

Expected results

Finish template in 30mins.

Actual results

Failed within 10 minutes with error The running ansible process received a shutdown signal.

Additional information

Issue behaves the same like on #14948 but that should be resolved by version 23.8.1 and I am using newest version of AWX.

TheRealHaoLiu commented 4 months ago

Random crashes with message The running ansible process received a shutdown signal.

where are u seeing this? please provide some context

currently we do not have enough information to understand what is happening here.

Peter1295 commented 4 months ago

AWX Template fails with that message. Time is random, mostly between 7-12mins of job running, I can see it happens with a jobs what are doing changes on multiple hosts (patching, VM customization etc.).

Unfortunately awx-task logs do not show anything helpful, just a message job/workflow failed. Workflow job 18542 failed due to reason: No error handling path for workflow job node(s) [(26156,failed)]. Workflow job node(s) missing unified job template and error handling path [].

Cluster is running on k8s with setup of 2 Control planes and 4 worker nodes, where maximum CPU and Memory usage based on command kubectl top node is around 20%/80% (CPU/MEM) and all nodes have at least 40% free disk space.

AWX database is running on external Postgres server.

Peter1295 commented 4 months ago

Attaching logs from automation pod what failed in the middle. Absolutely no info what is happening, not from awx-task, awx-web nor automation pod. Any suggestion what to look for? task3.log

AWX is really needed for us, we are using it for managing, deploying, patching etc. on daily basis, it is running at least 50 templates a day and I cannot be permanently connected on it to check if it's still working. We have another instance on production environment, where we still run 23.3.1 what is running properly, but unfortunately downgrade is not working anymore, it cannot use upgraded database.

Peter1295 commented 4 months ago

Another update, issue is not version related, I was able to downgrade AWX with to version 23.8.1 (what should not have such problems). Issue is not even with database, where I used both actual and older postgres from before migration. Sometimes it fails in 5min, sometimes job run for almost 1h.

Peter1295 commented 4 months ago

Issue persist also on 24.6.0. Kubernetes logs shows only info about successful shutdown of automation pod, not what and why it is happening.

ansible / awx