Open DaDenniX opened 1 year ago
Issue still present in 21.5.0
Same with AWX 19.4.0
I'm experiencing the same issue.
This could be https://github.com/ansible/ansible-runner/issues/998
@AdityaVishwekar @DaDenniX @CWollinger
the log rotation issue is not the logs from the awx-task container, rather the automation-job* pod that the job itself is running in
you'll want to make sure your docker max container size is > how much this playbook is supposed to output
when this problem occurs, do you see the automation-job pod hanging out, or does it get cleaned up?
AWX Team
automation-job pod gets cleaned up. Where is the configuration to set docker container size for this pod?
This sounds like either the timeout issue or log rotation issue we are hoping to address with this PR https://github.com/ansible/receptor/pull/683
@AdityaVishwekar you can change the log rotation size with the docker config, see my comment here on how I did it with minikube, but other k8s clusters might be slightly different
https://github.com/ansible/awx/issues/12644#issuecomment-1256843038
We are also facing this issue with v 21.1.0
Is there any workaround to fix this?
hello, seems i got the same problem with AWX-EE 22.4.0. The web version is quite older: AWX 20.0.1 After processing the tasks with bigger output - the Job output to appear - but the task itself is running. After tasks completes, the Job status changes to ERROR.
But the automation images completes:
kubectl -n awx logs -f automation-job-1779-pzf2w
=0 unreachable=0 \u001b[0;31mfailed=1 \u001b[0m skipped=0 rescued=0 ignored=0 \r\n\u001b[0;31mJAYNET01B\u001b[0m : \u001b[0;32mok=4 \u001b[0m changed=0 unreachable=0 \u001b[0;31mfailed=1 \u001b[0m skipped=0 rescued=0 ignored=0 \r\n\u001b[0;33mTEONET01A\u001b[0m : \u001b[0;32mok=12 \u001b[0m \u001b[0;33mchanged=1 \u001b[0m unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 \r\n\u001b[0;32mTEONET01B\u001b[0m : \u001b[0;32mok=5 \u001b[0m changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 ", "start_line": 3723, "end_line": 3730, "runner_ident": "1779", "event": "playbook_on_stats", "job_id": 1779, "pid": 17, "created": "2023-08-16T12:43:42.041638", "parent_uuid": "e246bf20-0ba0-4c5e-a0a3-c1da066a07c8", "event_data": {"playbook": "network_netbox.yml", "playbook_uuid": "e246bf20-0ba0-4c5e-a0a3-c1da066a07c8", "changed": {"TEONET01A": 1}, "dark": {}, "failures": {"JAYNET01A": 1, "JAYNET01B": 1}, "ignored": {}, "ok": {"TEONET01A": 12, "TEONET01B": 5, "JAYNET01A": 4, "JAYNET01B": 4}, "processed": {"TEONET01A": 1, "TEONET01B": 1, "JAYNET01A": 1, "JAYNET01B": 1}, "rescued": {}, "skipped": {}, "artifact_data": {}, "uuid": "67ac4489-acc8-4dea-aa37-069c95c097e6"}} {"status": "failed", "runner_ident": "1779"} {"zipfile": 1333} UEsDBBQAAAAIAKBjEFdYOKgdsAMAAJIKAAAHAAAAY29tbWFuZKVWW5OaSBT+KymfVy5eZtSqrVqEdiQiTTWtibW11dUgRjIILuBcksr+9u3mNoA6qRgfkO5z/853mv7ecaPDgYbbzuTD3x0aJr4TeN1jQF+dKHrs/PGh0+26e8/NX0/8WSjlMpo8do80SfLV1t/t8rddFD8mf/ayhc+fYnwKQy8W/fDJC9MofhX3UZLmdpmvv0oNL3wSvZc0pk80zuShlz4zd4T9O9GL8HoIOv+wbfeZJ105PsbRV89NuQHzwCTfO4vVFCATYGATG6C1rgJiQYTJHGPL5raDQZ/rX9GraSifPhNoAaRgiIgKTYygYQBEloqpPPB/gJGuvpmX/3NoZ25kSRj0BflO6MkD7o/vm8oScBk9pdGBpn4Udr9GTle+vx93j992vecycCP3kUSwanG71D1ORLHwLI2FXv9uMpK4kfVJqyFzS/55KFY8D1aBMboRjZY3BDEsCmiX2IY/r2cOc6RqBdValnkvnLdxkQR5Ukv6ApJE0TRUa1EO5CWD9zG/BZMzjxk/JqOfZNwE51ZqNifh1taeNazwY8+NtcE3pHe6dUaFa3rNHrGmXuNNa96azbyZsr/QpzN03xp14ZRpplvU1QLhHUpfhbV5eFkKnmfjc0piMYhcGoiJ44eT2rpavgmyl3zJHr8/9q05q47CqyxvccNQiYo3VnYQqMIKz7qjzNy09akByAwi5kCFBsyi4Pjk1cUcabIAG6LOgbrQzQeuNKNB0tDSzTUwWXUbsjItBdlAIzNFN4B2ySVTUJb6AhIEVIi0KoTddM3Ks5C+VjAgmoIVoulZgmJ6OIr0+YXw857s+70g2gYut/gIp0TPInIR33lLK98fZl1F8CNQMYu+1m0dmlyw242G4/FIdvoulWRneN8butLWdeSt1JfGnuMMZW/sSr16GYh1bUNmrEqbAFOZFtVWBSyVzwTw+EwzK+1e4r+ytpLC+zTlHGWXA1aUQLfsAnAQvkWhN+nLd4NGRNuel/QhnJsVJMWXvPgsFNoqJ5mKWYV2pm3XVWPv35Mfewd2pUiIGwUBuwGwD2ky+U8UinuKWN/OOb6nsSdeEDdggRyQanQuxYujwGtEyjcuxCgEXuo2txp1KoYxVdQFsYzVg27arYkNfEc8vqb7KOwLYzHxU3ZLo+4j/eIlpU9SJLn1E36BI8wscJiOWL40moA1uMJV1OwiwthY2Jbd1W1oMOZqTeoWcWic+jvqpolYEhWtTJMdBHCp45wzzVkoxaaxKebqTOvHj/8BUEsDBBQAAAAIAHVlEFd3o4ueCAAAAAYAAAAGAAAAc3RhdHVzS0vMzElNAQBQSwMEFAAAAAgAdWUQVw2+1RoDAAAAAQAAAAIAAAByYzMCAFBLAwQUAAAAAAB1ZRBXAAAAAAAAAAAAAAAACwAAAGpvYl9ldmVudHMvUEsBAhQDFAAAAAgAoGMQV1g4qB2wAwAAkgoAAAcAAAAAAAAAAAAAAICBAAAAAGNvbW1hbmRQSwECFAMUAAAACAB1ZRBXd6OLnggAAAAGAAAABgAAAAAAAAAAAAAAgIHVAwAAc3RhdHVzUEsBAhQDFAAAAAgAdWUQVw2+1RoDAAAAAQAAAAIAAAAAAAAAAAAAAICBAQQAAHJjUEsBAhQDFAAAAAAAdWUQVwAAAAAAAAAAAAAAAAsAAAAAAAAAAAAQAMBBJAQAAGpvYl9ldmVudHMvUEsFBgAAAAAEAAQA0gAAAE0EAAAAAA=={"eof": true}
Is there any timeline for this bug ?
I'm here with updates for you Updated to https://quay.io/repository/ansible/awx-ee?tab=tags&tag=22.7.0 Everything is well at first sight We will watch and test it for a few days Will be back with news
@EsDmitrii things still seeming to work perfectly?
@EsDmitrii things still seeming to work perfectly?
Hi! 50/50.. some hosts started to work well after upgrade, some hosts keep failing :( Don't know why. Still trying to find out why.
Is it the same hosts/jobs that fail each time for you? Or it changes?
Is it the same hosts/jobs that fail each time for you? Or it changes?
Hi, sorry for the late response Yes, I still face the issue with one specific huge job that runs on huge inventory Generally it solved the issue on multiple hosts I tested changes on two hosts and when it helped, I upgraded other hosts But some hosts still fails with the same issue Tried multiple ways to fix it, no luck I didn’t find any pattern that fixed the issue on several hosts and not fixed on other
Hi all! Sorry for the late response. My team and me divided huge jobs to a several smaller ones Now all works well
Please confirm the following
Bug Summary
Hi there,
I have two different playbooks:
Both jobs end with status "Error". After some research and investigating (I really don't have any experience with Kubernetes) I found out, that the worker container in the automation-pod (automation-job-xxxx-yyyyy) stops producing logs (so the output in AWX GUI will also freeze). The job then still runs some time after the freezing output, but after a few minutes the job failed in status "Error".
It seems to be, that it has nothing to do with the rotating log size (10 MB) I read here on existing issues, because the current logfile has only 6 MB.
The logfile of awx_task said:
2022-08-18T14:59:40.239996397+02:00 stderr F 2022-08-18 12:59:40,239 WARNING [008db9d212c94a569aabe6fac548d42d] awx.main.dispatch job 1018 (error) encountered an error (rc=None), please see task stdout for details.
Our settings in AWX are:![Bildschirmfoto zu 2022-08-18 15-18-25](https://user-images.githubusercontent.com/14914839/185404738-3af3510e-b50f-4539-bea8-ae5498923c6d.png)
I really don't have a clue what is causing this error. But we really need those playbooks working :-(
AWX version
21.4.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
Firefox
Steps to reproduce
Create a playbook with 4-5 tasks and start them on over 500 hosts. For example:
Expected results
Get the facts of all hosts and put the OS information in one csv file on somehost
Actual results
Job failed with status "Error"
Additional information
No response