Job error without log - Githubissues

ybucci commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Job finishs with status error without all complete log

AWX version

21.0.0

Select the relevant components

[X] UI
[X] API
[ ] Docs

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run job

Expected results

Finish job with error or success with complete log

Actual results

Job with error status without complete log

Additional information

Job Output

https://gist.github.com/ybucci/6db652089175de2feb8e1c28492d018f

Failed with error

Tioborto commented 2 years ago

Same for us

adpavlov commented 2 years ago

+1 in awx_task log just:

awx.main.dispatch job 1216 (error) encountered an error (rc=None), please see task stdout for details.

and nothing in stdout

moonrail commented 2 years ago

We can also experience this or at least a similiar issue on AWX 21.1.0. It happens on Inventory Updates and regular Playbook runs. It happens with different Ansible Core versions (2.10, 2.11 and 2.12) based completely on the official ansible-runner Execution Environments. We cannot find the common thread on to why though, as these Jobs just fail without any interaction. API reports for these jobs e.g.:

    "status": "error",
    "execution_environment": 5,
    "failed": true,
    "started": "2022-06-09T12:27:22.691919Z",
    "finished": "2022-06-09T12:28:43.090729Z",
    "canceled_on": null,
    "elapsed": 80.399,
(...)
    "job_explanation": "Job terminated due to error",`

When capturing the log of the K8s job, the following is seen: {"status": "canceled", "runner_ident": "2313176"} ansible-runner also outputs a base64 encoded ZIP-file. When investigating its contents, following can be found:

status: canceled
rc: 254
stderr is empty
stdout shows all stdout also visible in AWX UI Job log, so no discrepancies there

This seems like ansible-runner receives a True as result from cancel_callback: https://github.com/ansible/ansible-runner/blob/2.2.0/ansible_runner/runner.py#L321

AWX itself passes this function as cancel_callback: https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/callback.py#L177

task log contains something ambigious like: 2022-06-09 12:28:43,413 WARNING [ab304382f64e4f8eb5cfb1b9e0a3ec48] awx.main.dispatch job 2313176 (error) encountered an error (rc=None), please see task stdout for details. The rc is missing in this message, even though we can confirm it being returned by ansible-runner. Also the error raised is of type TaskError based on the message format and wording: https://github.com/ansible/awx/blob/21.1.0/awx/main/exceptions.py#L22 https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/jobs.py#L588

As rc is not used of ansible-runner, TaskError is used instead of TaskCancel and Job status is error there seems to be a bug e.g. in AWXReceptorJob that causes a cancelling of ansible-runner but an error in AWXReceptorJob.

moonrail commented 2 years ago

Found the cause (to our problem at least): Connections were terminated by the Loadbalancer before the K8s cluster, AWX was launching pods in. We have AWX run in Cluster A while Jobs are being run in Cluster B.

The Loadbalancer cut the connection, as no data was being transfered. After increasing the timeouts for client & server to (for our use-case) acceptable values, Jobs were completed.

Another cause can be the AWX setting DEFAULT_JOB_IDLE_TIMEOUT that is passed to ansible-runner as idle_timeout. If it is 0, ansible-runner sets a default of 600 seconds: https://ansible-runner.readthedocs.io/en/stable/intro/?highlight=idle_timeout#env-settings-settings-for-runner-itself We've increased this also and now our Jobs are running just fine.

@ AWX Maintainers reading this I would not call this issue to be solved yet, because the main issue is, that ansible-runner - whysoever - thinks its being canceled and AWX just errors out without any explanation. As connection issues can arise from a multitude of causes, AWX should catch and explain them to its users to avoid headaches. Also a keepalive between receptor and ansible-runner could help here.

Edit, additional info: Our instance of this issue came up, when no data was being transferred by ansible-runner due to modules running long before returning output. In our case the Loadbalancer has an idle timeout for client & server and both have to be longer than our expected runtime of these Ansible modules. It can of course be replicated by a ansible.builtin.command with sleep 3600 or something similiar.

AlanCoding commented 2 years ago

I know that this update_model method used in the cancel callback has an annoying behavior of returning None when a database error happens and doesn't recover.

https://github.com/ansible/awx/blob/29d60844a8cee16bcd31657cecc444a692ade689/awx/main/utils/update_model.py#L43

However, I have still usually seen a traceback in the logs that happened as a side effect of this. In awx.main.tasks.jobs, it will try to do something with self.instance, and it'll give some kind of error because that is None. This handling isn't ideal, it would be preferable to state that the database connection was lost. However, I don't see any documented evidence of this here.

It could be, as comments here suggest, that we get CallbackError from ansible-runner, and in that case I'm not sure how it would play out.

fosterseth commented 2 years ago

@ybucci Hi, did @moonrail 's suggestion help solve this bug for you?

adpavlov commented 2 years ago

I think it's related to #11805. In screenshot theres is 4 hours and 7 minutes difference between start and error of playbook

adpavlov commented 2 years ago

Issue still present in AWX 21.5.0

AlanCoding commented 2 years ago

I think I have seen this myself recently, but still unclear of cause. It would be good if anyone could get receptorctl information after it happens, look up the work_unit from the API and in shell see receptorctl work status <unit_id>, which may require flipping the setting to not release work. That could be hard.

I've wondered if this might be hitting https://github.com/ansible/awx/pull/12089, which still needs work to update. Even if this were happening, we would only get a very slightly better message from ansible-runner, possibly that the message isn't JSON... which actually could be the same root cause as the BadZipFile errors seen elsewhere.

AlanCoding commented 2 years ago

I've lost the logs from the case I saw, but I believe it was due to running into the postgres max connection limit.

lals1 commented 1 year ago

This issue started in our environment since a couple of day back and still persistent. Restarting AWX pod did work few days ago but now it seems to not making any difference as all jobs are resulting in ERROR state. Is this any workaround known?

chris93111 commented 1 year ago

Hi i don't know if is the same issue , but i have 2 jobs finish in same time, with error and no logs, and short time

The output job is not complete

AWX 21.6.0

In awx logs they are nothing expect Job terminated due to error

2022-10-27T02:01:01.62302983+02:00 stderr F 2022-10-27 00:01:01,622 WARNING  [2e34bef877db4d96bd5915404082b806] awx.main.dispatch job 671164 (error) encountered an error (rc=None), please see task stdout for details.
2022-10-27T02:01:01.627768355+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_error(*['add05b22-34f2-4989-9b44-cf7df76e5889'])
2022-10-27T02:01:01.627797983+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.tasks.system Executing error task id add05b22-34f2-4989-9b44-cf7df76e5889, subtasks: [{'type': 'job', 'id': 671164}]
2022-10-27T02:01:01.651078816+02:00 stderr F 2022-10-27 00:01:01,650 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_success(*[])

Nothing in ansible-runner logs

lbr38 commented 1 year ago

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s CentOS7

lbr38 commented 1 year ago

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s CentOS7

I resolved my problem by increasing pod's max log size.

Edit kubelet config:

vim /var/lib/kubelet/config.yaml

Add or edit containerLogMaxSize param and choose a new value (default is 10Mi):

containerLogMaxSize: 100Mi

Restart kubelet to take into account:

systemctl restart kubelet

All failed jobs now work again.

rsindar commented 1 year ago

Hello,

We have the same issue (AWX 21.10.2 in Google Kubernetes Engine). Unfortunately, we cannot change the containerLogMaxSize. Is there any other option?

Thank you

nan0viol3t commented 1 year ago

Hello, I'm also facing problem with job logging. Issue kick in after upgrade of Kubernetes (AWS EKS Platform) with combination of AWX Updates. Currently on latest AWX (multiple version tested past last month) and k8 is 1.24 Its probably due to K8 switch to "containerd".

(I have no option to downgrade EKS/K8 and downgrading AWX is also not an viable option, but I tested few latest version of AWX and results are the same)

I was looking at reported issues and followed ""containerLogMaxSize"" size increase but that does not make any changes to job output. I'm pretty sure that my action is failing due to AWS/botocore issues but I have no indication of it from AWX logs. Output I get in web UI on 0 verbosity is just blank. With verbosity set to 3 - I see some job details but it ends on:

" EXEC /bin/sh -c 'rm -f -r /home/runner/.ansible/tmp/ansible-tmp-1673484679.613122-71-189407760331966/ > /dev/null 2>&1 && sleep 0'"

Job status is same for multiple attempts, and I quote -> "Job terminated due to error"

In the task container log I can see only this: 2023-01-12T00:51:29.225191663Z stderr F 2023-01-12 00:51:29,224 INFO [4ea4c01a8d0c4b8b952845d19f698f0f] awx.analytics.job_lifecycle job-3151 finalize run 2023-01-12T00:51:29.255313496Z stderr F 2023-01-12 00:51:29,254 WARNING [4ea4c01a8d0c4b8b952845d19f698f0f] awx.main.dispatch job 3151 (error) encountered an error (rc=None), please see task stdout for details.

Job 3151 - is the one which is failing but its also part of workflow, so I get some more output for failed workflow as well, but consider as not related. Also, same workflow is working on my really old prod instance (10.0.0) which is kept for such events/fails as I'm facing now.

I cant recall now which version of AWX we had, which was working fine, but I feel this is more likely to be the result of K8 upgrade on top of AWX upgrades (it was around 6 minor versions of AWX). After upgrade, I had problems with passing "containerLogMaxSize" to kubelet configuration and for now its done manually on worker node, by editing kubelet systemd policy and kubelet restart. Systemctl status is showing that kubelet has new setting for "containerLogMaxSize" , but it takes no effect in AWX/job output.

Please advise on how I can proceed, Any test that I can perform? I'll appreciate any guidance.

nan0viol3t commented 1 year ago

In addition to previous comment,

I added "no_log: true" to failing job and it is now working, besides "containerLogMaxSize" was set to 500MB for two weeks now. So, it seems it is related to amount of logs but not on the container storage level. Not sure, but still I can be of service in case of any tests you would want me to perform.

I will test few more workflow runs with "no_log" option for next few days to see if this is really resolution of the problem.

rsindar commented 1 year ago

My problem was solved with "Job Slicing: 5" parameter in the template's settings (God bless Telegram chats)

adpavlov commented 1 year ago

My problem was solved with "Job Slicing: 5" partameter in the template's settings (God bless Telegram chats)

Could you please charge chat url?:)

rsindar commented 1 year ago

the very good one - https://t.me/pro_ansible

*just in case - the primary chat language is russian

yzhivkov commented 1 year ago

Depending on what version of Kubernetes is used you may run into this problem. Take a look at https://github.com/ansible/awx/issues/13380#issuecomment-1426903462 for a possible temporary solution until the fix is released with a later AWX version

AlanCoding commented 1 year ago

Related issue in receptor: https://github.com/ansible/receptor/issues/736

nan0viol3t commented 1 year ago

@yzhivkov - Thanks for pointing potential solution - it is working! I was tracking multiple issues around this problem on AWX github for around month already and was trying receptor patched version (dev one), but having this specified receptor option as AWX operator modification is easy to implement and again, it is working. Thank you!

adpavlov commented 1 year ago

Depending on what version of Kubernetes is used you may run into this problem. Take a look at https://github.com/ansible/awx/issues/13380#issuecomment-1426903462 for a possible temporary solution until the fix is released with a later AWX version

But this WA will revert the fix for long running jobs that result in error after 4 hours I believe

EsDmitrii commented 1 year ago

same for us awx ver 22.3.0

jrgoldfinemiddleton commented 1 year ago

Just want to share that we faced these kinds of job error situations after some time and the solution we found was to increase a few Linux kernel settings on our Kubernetes worker nodes:

fs.inotify.max_user_instances = 81920 fs.inotify.max_user_watches = 819200

It appears that AWX relies very heavily on the inotify API. Haven't seen a lot of comments indicating that other folks were able to resolve their erroring job issues with these kernel param adjustments so wanted to publicize this info.

Personally I know very little about the inotify API but would be curious if someone here knows whether AWX might perhaps be rather wasteful with inotify watches and instances (especially instances).

AWX version: 22.7.0, but same issue was present in 22.3.0 and lower.

EsDmitrii commented 1 year ago

jrgoldfinemiddleton commented Sep 8, 2023 •

Unfortunately not helped in my case( I upgraded to 22.7.0 earlier Tried to set params you mentioned. In my case default params were fs.inotify.max_user_instances = 128 and fs.inotify.max_user_watches = 16777216, I set fs.inotify.max_user_instances = 81920 and fs.inotify.max_user_watches = 167772160 Jobs still fail with these params (checked that it applied)

KillJohn commented 1 year ago

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

fivetran-joliveira commented 1 year ago

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

Same here

For me, after upgrading from 22.3.0 to 22.7.0 the error on /api/v2/jobs/

From Job terminated due to error to The running ansible process received a shutdown signal

For me even running playbooks against a single host the issue happens for many servers, whereas until few months ago we had playbooks running against hundreds of servers without being interrupted.

Could anyone on AWX team just tell us which version we should downgrade to until a stable one is available, since this seems to be an issue with more recent versions?

chris93111 commented 1 year ago

Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem, i have try to disable reconnect, the log mag size is configured to 500Mi

On logs i don't see the problem , the status of the job is successful, but streaming is not complete , Job terminated due to error in events, and EOF error in awx-EE

KillJohn commented 1 year ago

So in our case it looks like setup kubelet-args in config.yaml (rke2) correctly solve the issue.

kubelet-arg:
  - container-log-max-files=9
  - container-log-max-size=50Mi
 or kubelet-arg: [ container-log-max-files=9, container-log-max-size=50Mi ]

then you can check rke2-agent service if your arguments has been passed correctly. hope it helps.

chris93111 commented 1 year ago

@KillJohn is already configured in kubelet to 500, the job do not exceed

nan0viol3t commented 1 year ago

So, in my case, AWX is built on top of AWS EKS. Problem appeared after upgrade of underlining EKS platform (Kubernetes from 1.23 to 1.24). This was somehow related to Kubernetes migration from Docker to ContainerD. There was no easy/possible way to upgrade to newer Kubernetes version, as suggested on different threads... Of course I added kubelet arges on worker nodes:

container-log-max-files=9
container-log-max-size=50Mi

but that doesnt solve the problem.

It seems the issue was related to AWX "receptor" and its log stream. I cant find the other git issue where it was discussed but there was potential solution in a developer patch to AWX/receptor. This was true for AWX version from around February 2023. Eventually new Kubernetes and new AWX version (with patched receptor) appear around late Feb, and upgrades did solve my problem. It seems its related to log streaming of AWX but its also platform dependent.

If any of you who still facing this issue, built AWX on top of Kubernetes, try upgrade k8s to at least 1.25.x or newer, as that could be a solution (it was in my case), including latest version of AWX (at least around March this year or newer).

Apologize for lack of details, it was solved more then 6 months ago in my case. Still on AWX, all working fine...

Ahh and just as a side note, in my case, before actual remediation/resolution, limiting log output in particular Ansible tasks, like adding "no log = true" as example countermeasure whenever I could, was helpful to finish jobs - just as a confirmation/poc, that any task which was generating text output was causing job to fail (so I used multiple/different hacks to limit/strict to minimum text output). This suggestion can be used to prove/confirm, that text output/ logs stream are causing this issue...

AlanCoding commented 11 months ago

It would be best if any comments could reference a specific error message, or an overview of the situation explaining why you could not find a specific error message.

From Job terminated due to error to The running ansible process received a shutdown signal

Architecturally, if the control node for a job is shut down, that job is lost. Depending on your deployment details, K8S can do that. This new message (about "shutdown signal") is helpful - something shut down services while that job was running. You can :+1: work to change that at https://github.com/ansible/awx/issues/13848

There are a lot of different issues mixed in these comments. This type of stability is a big topic, and it is a priority for us. Keeping the discussion focused to the particular error mechanism will help get action.

EsDmitrii commented 9 months ago

Hi all! Sorry for the late response. My team and me divided huge jobs to a several smaller ones Now all works well

ansible / awx

Job error without log #12297

Please confirm the following

Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information