Open ybucci opened 2 years ago
Same for us
+1 in awx_task log just:
awx.main.dispatch job 1216 (error) encountered an error (rc=None), please see task stdout for details.
and nothing in stdout
We can also experience this or at least a similiar issue on AWX 21.1.0.
It happens on Inventory Updates and regular Playbook runs.
It happens with different Ansible Core versions (2.10, 2.11 and 2.12) based completely on the official ansible-runner
Execution Environments.
We cannot find the common thread on to why though, as these Jobs just fail without any interaction.
API reports for these jobs e.g.:
"status": "error",
"execution_environment": 5,
"failed": true,
"started": "2022-06-09T12:27:22.691919Z",
"finished": "2022-06-09T12:28:43.090729Z",
"canceled_on": null,
"elapsed": 80.399,
(...)
"job_explanation": "Job terminated due to error",`
When capturing the log of the K8s job, the following is seen: {"status": "canceled", "runner_ident": "2313176"}
ansible-runner
also outputs a base64 encoded ZIP-file. When investigating its contents, following can be found:
This seems like ansible-runner
receives a True
as result from cancel_callback
:
https://github.com/ansible/ansible-runner/blob/2.2.0/ansible_runner/runner.py#L321
AWX itself passes this function as cancel_callback: https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/callback.py#L177
task
log contains something ambigious like: 2022-06-09 12:28:43,413 WARNING [ab304382f64e4f8eb5cfb1b9e0a3ec48] awx.main.dispatch job 2313176 (error) encountered an error (rc=None), please see task stdout for details.
The rc is missing in this message, even though we can confirm it being returned by ansible-runner
.
Also the error raised is of type TaskError
based on the message format and wording:
https://github.com/ansible/awx/blob/21.1.0/awx/main/exceptions.py#L22
https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/jobs.py#L588
As rc is not used of ansible-runner
, TaskError is used instead of TaskCancel and Job status is error
there seems to be a bug e.g. in AWXReceptorJob
that causes a cancelling of ansible-runner
but an error in AWXReceptorJob
.
Found the cause (to our problem at least): Connections were terminated by the Loadbalancer before the K8s cluster, AWX was launching pods in. We have AWX run in Cluster A while Jobs are being run in Cluster B.
The Loadbalancer cut the connection, as no data was being transfered. After increasing the timeouts for client & server to (for our use-case) acceptable values, Jobs were completed.
Another cause can be the AWX setting DEFAULT_JOB_IDLE_TIMEOUT
that is passed to ansible-runner
as idle_timeout
.
If it is 0
, ansible-runner
sets a default of 600
seconds:
https://ansible-runner.readthedocs.io/en/stable/intro/?highlight=idle_timeout#env-settings-settings-for-runner-itself
We've increased this also and now our Jobs are running just fine.
@ AWX Maintainers reading this
I would not call this issue to be solved yet, because the main issue is, that ansible-runner
- whysoever - thinks its being canceled and AWX
just errors out without any explanation.
As connection issues can arise from a multitude of causes, AWX should catch and explain them to its users to avoid headaches.
Also a keepalive between receptor and ansible-runner
could help here.
Edit, additional info:
Our instance of this issue came up, when no data was being transferred by ansible-runner
due to modules running long before returning output. In our case the Loadbalancer has an idle timeout for client & server and both have to be longer than our expected runtime of these Ansible modules. It can of course be replicated by a ansible.builtin.command
with sleep 3600
or something similiar.
I know that this update_model
method used in the cancel callback has an annoying behavior of returning None
when a database error happens and doesn't recover.
However, I have still usually seen a traceback in the logs that happened as a side effect of this. In awx.main.tasks.jobs
, it will try to do something with self.instance
, and it'll give some kind of error because that is None
. This handling isn't ideal, it would be preferable to state that the database connection was lost. However, I don't see any documented evidence of this here.
It could be, as comments here suggest, that we get CallbackError
from ansible-runner, and in that case I'm not sure how it would play out.
@ybucci Hi, did @moonrail 's suggestion help solve this bug for you?
I think it's related to #11805. In screenshot theres is 4 hours and 7 minutes difference between start and error of playbook
Issue still present in AWX 21.5.0
I think I have seen this myself recently, but still unclear of cause. It would be good if anyone could get receptorctl information after it happens, look up the work_unit from the API and in shell see receptorctl work status <unit_id>
, which may require flipping the setting to not release work. That could be hard.
I've wondered if this might be hitting https://github.com/ansible/awx/pull/12089, which still needs work to update. Even if this were happening, we would only get a very slightly better message from ansible-runner, possibly that the message isn't JSON... which actually could be the same root cause as the BadZipFile errors seen elsewhere.
I've lost the logs from the case I saw, but I believe it was due to running into the postgres max connection limit.
This issue started in our environment since a couple of day back and still persistent. Restarting AWX pod did work few days ago but now it seems to not making any difference as all jobs are resulting in ERROR state. Is this any workaround known?
Hi i don't know if is the same issue , but i have 2 jobs finish in same time, with error and no logs, and short time
The output job is not complete
AWX 21.6.0
In awx logs they are nothing expect Job terminated due to error
2022-10-27T02:01:01.62302983+02:00 stderr F 2022-10-27 00:01:01,622 WARNING [2e34bef877db4d96bd5915404082b806] awx.main.dispatch job 671164 (error) encountered an error (rc=None), please see task stdout for details.
2022-10-27T02:01:01.627768355+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_error(*['add05b22-34f2-4989-9b44-cf7df76e5889'])
2022-10-27T02:01:01.627797983+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG [2e34bef877db4d96bd5915404082b806] awx.main.tasks.system Executing error task id add05b22-34f2-4989-9b44-cf7df76e5889, subtasks: [{'type': 'job', 'id': 671164}]
2022-10-27T02:01:01.651078816+02:00 stderr F 2022-10-27 00:01:01,650 DEBUG [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_success(*[])
Nothing in ansible-runner logs
Hello
I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.
AWX 21.8.0 on k8s CentOS7
Hello
I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.
AWX 21.8.0 on k8s CentOS7
I resolved my problem by increasing pod's max log size.
Edit kubelet config:
vim /var/lib/kubelet/config.yaml
Add or edit containerLogMaxSize param and choose a new value (default is 10Mi):
containerLogMaxSize: 100Mi
Restart kubelet to take into account:
systemctl restart kubelet
All failed jobs now work again.
Hello,
We have the same issue (AWX 21.10.2 in Google Kubernetes Engine). Unfortunately, we cannot change the containerLogMaxSize. Is there any other option?
Thank you
Hello, I'm also facing problem with job logging. Issue kick in after upgrade of Kubernetes (AWS EKS Platform) with combination of AWX Updates. Currently on latest AWX (multiple version tested past last month) and k8 is 1.24 Its probably due to K8 switch to "containerd".
(I have no option to downgrade EKS/K8 and downgrading AWX is also not an viable option, but I tested few latest version of AWX and results are the same)
I was looking at reported issues and followed ""containerLogMaxSize"" size increase but that does not make any changes to job output. I'm pretty sure that my action is failing due to AWS/botocore issues but I have no indication of it from AWX logs. Output I get in web UI on 0 verbosity is just blank. With verbosity set to 3 - I see some job details but it ends on:
"
Job status is same for multiple attempts, and I quote -> "Job terminated due to error"
In the task container log I can see only this: 2023-01-12T00:51:29.225191663Z stderr F 2023-01-12 00:51:29,224 INFO [4ea4c01a8d0c4b8b952845d19f698f0f] awx.analytics.job_lifecycle job-3151 finalize run 2023-01-12T00:51:29.255313496Z stderr F 2023-01-12 00:51:29,254 WARNING [4ea4c01a8d0c4b8b952845d19f698f0f] awx.main.dispatch job 3151 (error) encountered an error (rc=None), please see task stdout for details.
Job 3151 - is the one which is failing but its also part of workflow, so I get some more output for failed workflow as well, but consider as not related. Also, same workflow is working on my really old prod instance (10.0.0) which is kept for such events/fails as I'm facing now.
I cant recall now which version of AWX we had, which was working fine, but I feel this is more likely to be the result of K8 upgrade on top of AWX upgrades (it was around 6 minor versions of AWX). After upgrade, I had problems with passing "containerLogMaxSize" to kubelet configuration and for now its done manually on worker node, by editing kubelet systemd policy and kubelet restart. Systemctl status is showing that kubelet has new setting for "containerLogMaxSize" , but it takes no effect in AWX/job output.
Please advise on how I can proceed, Any test that I can perform? I'll appreciate any guidance.
In addition to previous comment,
I added "no_log: true" to failing job and it is now working, besides "containerLogMaxSize" was set to 500MB for two weeks now. So, it seems it is related to amount of logs but not on the container storage level. Not sure, but still I can be of service in case of any tests you would want me to perform.
I will test few more workflow runs with "no_log" option for next few days to see if this is really resolution of the problem.
My problem was solved with "Job Slicing: 5" parameter in the template's settings (God bless Telegram chats)
My problem was solved with "Job Slicing: 5" partameter in the template's settings (God bless Telegram chats)
Could you please charge chat url?:)
the very good one - https://t.me/pro_ansible
*just in case - the primary chat language is russian
Depending on what version of Kubernetes is used you may run into this problem. Take a look at https://github.com/ansible/awx/issues/13380#issuecomment-1426903462 for a possible temporary solution until the fix is released with a later AWX version
Related issue in receptor: https://github.com/ansible/receptor/issues/736
@yzhivkov - Thanks for pointing potential solution - it is working! I was tracking multiple issues around this problem on AWX github for around month already and was trying receptor patched version (dev one), but having this specified receptor option as AWX operator modification is easy to implement and again, it is working. Thank you!
Depending on what version of Kubernetes is used you may run into this problem. Take a look at https://github.com/ansible/awx/issues/13380#issuecomment-1426903462 for a possible temporary solution until the fix is released with a later AWX version
But this WA will revert the fix for long running jobs that result in error after 4 hours I believe
same for us awx ver 22.3.0
Just want to share that we faced these kinds of job error situations after some time and the solution we found was to increase a few Linux kernel settings on our Kubernetes worker nodes:
fs.inotify.max_user_instances = 81920
fs.inotify.max_user_watches = 819200
It appears that AWX relies very heavily on the inotify API. Haven't seen a lot of comments indicating that other folks were able to resolve their erroring job issues with these kernel param adjustments so wanted to publicize this info.
Personally I know very little about the inotify API but would be curious if someone here knows whether AWX might perhaps be rather wasteful with inotify watches and instances (especially instances).
AWX version: 22.7.0, but same issue was present in 22.3.0 and lower.
jrgoldfinemiddleton commented Sep 8, 2023 •
Unfortunately not helped in my case( I upgraded to 22.7.0 earlier Tried to set params you mentioned. In my case default params were fs.inotify.max_user_instances = 128 and fs.inotify.max_user_watches = 16777216, I set fs.inotify.max_user_instances = 81920 and fs.inotify.max_user_watches = 167772160 Jobs still fail with these params (checked that it applied)
Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(
Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(
Same here
For me, after upgrading from 22.3.0 to 22.7.0 the error on /api/v2/jobs/
From Job terminated due to error
to The running ansible process received a shutdown signal
For me even running playbooks against a single host the issue happens for many servers, whereas until few months ago we had playbooks running against hundreds of servers without being interrupted.
Could anyone on AWX team just tell us which version we should downgrade to until a stable one is available, since this seems to be an issue with more recent versions?
Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem, i have try to disable reconnect, the log mag size is configured to 500Mi
On logs i don't see the problem , the status of the job is successful, but streaming is not complete , Job terminated due to error in events, and EOF error in awx-EE
So in our case it looks like setup kubelet-args in config.yaml (rke2) correctly solve the issue.
kubelet-arg:
- container-log-max-files=9
- container-log-max-size=50Mi
or kubelet-arg: [ container-log-max-files=9, container-log-max-size=50Mi ]
then you can check rke2-agent service if your arguments has been passed correctly. hope it helps.
@KillJohn is already configured in kubelet to 500, the job do not exceed
So, in my case, AWX is built on top of AWS EKS. Problem appeared after upgrade of underlining EKS platform (Kubernetes from 1.23 to 1.24). This was somehow related to Kubernetes migration from Docker to ContainerD. There was no easy/possible way to upgrade to newer Kubernetes version, as suggested on different threads... Of course I added kubelet arges on worker nodes:
but that doesnt solve the problem.
It seems the issue was related to AWX "receptor" and its log stream. I cant find the other git issue where it was discussed but there was potential solution in a developer patch to AWX/receptor. This was true for AWX version from around February 2023. Eventually new Kubernetes and new AWX version (with patched receptor) appear around late Feb, and upgrades did solve my problem. It seems its related to log streaming of AWX but its also platform dependent.
If any of you who still facing this issue, built AWX on top of Kubernetes, try upgrade k8s to at least 1.25.x or newer, as that could be a solution (it was in my case), including latest version of AWX (at least around March this year or newer).
Apologize for lack of details, it was solved more then 6 months ago in my case. Still on AWX, all working fine...
Ahh and just as a side note, in my case, before actual remediation/resolution, limiting log output in particular Ansible tasks, like adding "no log = true" as example countermeasure whenever I could, was helpful to finish jobs - just as a confirmation/poc, that any task which was generating text output was causing job to fail (so I used multiple/different hacks to limit/strict to minimum text output). This suggestion can be used to prove/confirm, that text output/ logs stream are causing this issue...
It would be best if any comments could reference a specific error message, or an overview of the situation explaining why you could not find a specific error message.
From Job terminated due to error to The running ansible process received a shutdown signal
Architecturally, if the control node for a job is shut down, that job is lost. Depending on your deployment details, K8S can do that. This new message (about "shutdown signal") is helpful - something shut down services while that job was running. You can :+1: work to change that at https://github.com/ansible/awx/issues/13848
There are a lot of different issues mixed in these comments. This type of stability is a big topic, and it is a priority for us. Keeping the discussion focused to the particular error mechanism will help get action.
Hi all! Sorry for the late response. My team and me divided huge jobs to a several smaller ones Now all works well
Please confirm the following
Summary
Job finishs with status error without all complete log
AWX version
21.0.0
Select the relevant components
Installation method
kubernetes
Modifications
yes
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Run job
Expected results
Finish job with error or success with complete log
Actual results
Job with error status without complete log
Additional information
Job Output
https://gist.github.com/ybucci/6db652089175de2feb8e1c28492d018f
Failed with error