ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.69k stars 3.37k forks source link

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Open Lee-Kwang opened 1 year ago

Lee-Kwang commented 1 year ago

Please confirm the following

Bug Summary

Started a job on the container group on Openshift cluster and ended as failed when log size limit was reached and it takes for ages or fails to retrieve the failed job output and viewing the job template gets error 'Something went wrong' and this error is cleared by deleting the failed job.

Behaviour observed:

Started a job on the container group in Openshift cluster against inventory with some hosts or local host. The playbook printed out many lines of debug message until the log size limit is reached. The job got stuck for minutes and ended as failed. I tried with all options of RECEPTOR_KUBE_SUPPORT_RECONNECT, the results were same.

Retrieving the job output took ages or ended up as 'Something went wrong' When trying to view the job template, I often get 'Something went wrong' error. The Jobs view often gets 'Something went wrong' After deleting the failed job, I could view the job template details and Jobs view works ok.

AWX version

21.5.1

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

core 2.13.8

Operating system

Red Hat Enterprise Linux release 9.1 (Plow) UBI

Web browser

Chrome

Steps to reproduce

create a job template with playbook which prints out tens of thousand debug message. create an inventory with some remote hosts or local host. start the job on the container group in remote cluster against the above inventory

Expected results

The job ends successfully

Actual results

job failed. can't view job output. failed job blocks access to the job template.

Additional information

No response

fosterseth commented 1 year ago

I have a feeling you may be running into https://github.com/ansible/awx/pull/12961

that PR landed in awx 21.11.0

what happened was that when jobs ended in a failed state, AWX would attempt to gather the entire output of the job pod and stick it into the job's result_traceback field. However, that output could be MASSIVE (all of the stdout) and was breaking things.

in 21.11.0+ it will cap the output to the last 1000 bytes or so. I bet the job template detail page was loading the last ran job and the uwsgi process was just dying while trying to load that job.

This won't explain why those jobs failed in the first place (you mention log rotation limit problem, which can be addressed via k8s configuration). However, on 21.11.0+ it should break the UI when these failures happen.