Failed to JSON parse a line from worker stream due to unexpected EOF(b'')

aryklein commented 10 months ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

I've encountered an issue with some of my jobs in AWX 23.3.1, and I’m hoping to gather insights or solutions from anyone who might have faced a similar problem.

Error Message

The execution fails with an error message on the details tab:

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'

Environment

AWX Version: 23.3.1 Deployment Platform: Kubernetes Operator Version: 2.7.2

I'm inclined to think that this issue stems from a UI error during the process of logging data from the pod, although I'm not entirely certain.

Any Idea?

AWX version

23.3.1

Select the relevant components

[X] UI
[ ] UI (tech preview)
[ ] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

Installation method

kubernetes

Modifications

no

Ansible version

ansible [core 2.16.0]

Operating system

Linux

Web browser

Firefox, Chrome

Steps to reproduce

Randomly at executing some templates

Expected results

Finish the Ansible playbook execution without errors

Actual results

The Jobs don't fail, they just finish the execution with error:

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'

Additional information

I tried using different Ansible EE version (included the latest one)

jessicamack commented 10 months ago

Hello, we'll need some further information. Can you please set the following settings AWX_CLEANUP_PATHS = False and RECEPTOR_RELEASEWORK = False. Then check the /tmp/awx_/artifacts//job_events directory. Confirm all the files there are JSON. If one isn't, please report back which one(s). This should all be done in the EE container in the task pod. Please also share the logs for the job pods. We're trying to confirm how far the job was able to run before failure.

TheRealHaoLiu commented 10 months ago

can u also provide the receptor log and what kind of kubernetes you are using

TheRealHaoLiu commented 10 months ago

can you retry with the latest EE image for the controlplane EE... u can do this by changing the imagepullpolicy in awx to Alaways than switch it back to IfNotPresent

aryklein commented 10 months ago

can u also provide the receptor log and what kind of kubernetes you are using

Kubernetes v1.24.2 self managed by kubeadm How could I get the receptor log?

I'm Apologies for my limited knowledge on this subject. I attempted to include AWX_CLEANUP_PATHS=False and RECEPTOR_RELEASE_WORK=False in the configMap, but couldn't locate the appropriate section for their addition. Consequently, I chose to input them in the web UI, specifically under Settings, within the Job section, and in Extra Environment Variables. Could you please confirm if this approach is correct?

I achieved the desired configuration by editing the AWX object. Specifically, I added the extra settings in the YAML file under the 'spec' section. The adjustments were as follows:

extra_settings:
   - setting: RECEPTOR_RELEASE_WORK
     value: "False"
   - setting: AWX_CLEANUP_PATHS
     value: "False"

aryklein commented 10 months ago

@jessicamack,

Then check the /tmp/awx__/artifacts//job_events directory. Confirm all the files there are JSON.

The directory is empty:

kubectl exec -it awx-task-78cbf7c589-bzgd8 -c awx-task  -- bash                                                                                                                                                                               
bash-5.1# ls tmp/awx_7976_m5_ejtdl/artifacts/7976/job_events/
bash-5.1#

Here you also have the logs from job pod: automation-job-7976-z6vnr.log

fosterseth commented 10 months ago

your provided job output log looks good at first glance. Does the UI job output stdout page show all of those events?

where are you seeing Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b' exactly in the UI?

can you provide a screenshot of it?

aryklein commented 10 months ago

In the provided screenshot, line 6950 appears as the final line visible within the user interface. As a next step, I plan to conduct an experiment by deploying the same AWX setup on EKS, as opposed to using my self-managed Kubernetes cluster.

BTW, I found 2 more users with the same issue: https://www.reddit.com/r/awx/comments/176za7y/issue_with_json_parsing_error_in_awx_2320_on/

2023-12-06_13-54-01 2023-12-06_13-55-04

Dodexq commented 10 months ago

Same issue. Cluster version v1.24.12+rke2r1, AWX 23.3.1, the error occurs randomly, yesterday everything worked on custom Execution Environments

UPD: When I set env https://github.com/ansible/receptor/pull/683 ee_extra_env: |

name: RECEPTOR_KUBE_SUPPORT_RECONNECT value: disable the error became complete and it: Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'failed to create fsnotify watcher: too many open files after that I set the value to x3 (from installed by default) from the previous https:// serverfault.com/questions/1137211/failed-to-create-fsnotify-watcher-too-many-open-files and everything worked :)

Hope this helps you

djyasin commented 9 months ago

Hello @aryklein thank you for providing those additional screenshots.

Could you go to Settings> TroubleShooting Settings> Edit> From here, turn off temp dir cleanup and receptor release work

And then, in the control plane ee container (this is located in the task pod) get the/tmp/awx_<job_id>_<"*">

And also get the/tmp/receptor/<node name>/<work unit id>

You should be able to get the work unit id from the API for that job run.

Please provide us with the artifacts directory and the stdout file.

aryklein commented 9 months ago

@djyasin I think I did it here right? https://github.com/ansible/awx/issues/14693#issuecomment-1836626977

mxs-weixiong commented 9 months ago

I got this error as well.

"Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''

AWX 23.4.0 Execution Environment: 23.0.0

I no idea what is causing this. Logs doesn't tell me anything. It will occur at different job.

If anyone got idea on how to troubleshoot this , please advise me.

Thanks.

kuba2915 commented 9 months ago

+1

My details: Kubeadm cluster 1.28 on proxmox AWX-Operator : 2.9.0 (helm, installed by argo with default values) AWX definition with only spec for ingress.

Logs are the same in my case. It looks like that tasks are completed, but AWX fails when read status from execution environment.

Also when Verbosity is set to debug tasks are sucessful, but not always (~70% chances)

aryklein commented 9 months ago

I recently migrated my AWX deployment to EKS, with Kubernetes version 1.25, and the issue has completely vanished

mis4s commented 9 months ago

Same issue on k3s v1.28

Edit: it seems to be resolved after updating inotify:

fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

JSGUYOT commented 9 months ago

Same issue with : AWX 23.5.1 AWX OPERATOR : 2.9.0 K3S : v1.27.2+k3s1

rchaud commented 9 months ago

Im experiencing the same issue in 23.3.0 and K8s 1.24. I will dig deeper to troubleshoot more and capture some logs after the holidays.

marek1712 commented 9 months ago

Same issue on k3s v1.28

Thanks!

I was on 1.27 and inotify increases didn't change a thing. Updating to 1.28 helped.

mattiaoui commented 9 months ago

hi, i have same error on awx 23.5.1. I restore DB frrom old k3s server, project sync work without problem,but all template terminate with error: Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''

Template OUTPUT: Worker output: ": "/tmp/awx_3386_b795lk3f", "JOB_ID": "3386", "INVENTORY_ID": "3", "PROJECT_REVISION": "e6d9f8f52adb8de7d052019a05326020c0d1cc4a", "ANSIBLE_RETRY_FILES_ENABLED": "False", "MAX_EVENT_RES": "700000", "AWX_HOST": "https://awx.xxx.xxx.loc", "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp", "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles", "ANSIBLE_COLLECTIONS_PATH": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.9/site-packages/ansible_runner/display_callback/callback", "ANSIBLE_STDOUT_CALLBACK": "awx_display", "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/3386", "RUNNER_OMIT_EVENTS": "False", "RUNNER_ONLY_FAILED_EVENTS": "False"}, "cwd": "/runner/project"} {"status": "running", "runner_ident": "3386"}.

i try with more different release off EE, but problem same. Have Idea? Bye Mattia

marek1712 commented 9 months ago

@mattiaoui Upgrade your K3s to 1.28. Then follow this tip: https://github.com/ansible/awx/issues/14693#issuecomment-1861167020

mattiaoui commented 9 months ago

@mattiaoui Upgrade your K3s to 1.28. Then follow this tip:

https://github.com/ansible/awx/issues/14693#issuecomment-1861167020

Many Thanks,after upgrade all template work 🤟🤟. Happy Holiday and New Years

kuba2915 commented 9 months ago

In my case, temporary "fix" was setting Debug(3) in Verbosity on all my templates. After this change, all scheduled tasks was successful for about a week.

But during upgrade to k8s 1.29 I noticed that is issue with kube-proxy on node with awx, because it was still in CrashBootloop state. I applied fix from here https://github.com/kubernetes-sigs/kind/issues/2744#issuecomment-1127808069 and it fixed my problem

But it is basically the same as @marek1712 mentioned, but with other values.

rchaud commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image).

I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me.

i think it is something with the latest image running as a controller.

I did not make any changes to my sysctl inodes or anything else.

I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version.

I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

ngsin commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image).

I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me.

i think it is something with the latest image running as a controller.

I did not make any changes to my sysctl inodes or anything else.

I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version.

I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

rchaud commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.

ngsin commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.

I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error

Error probabilities occur when job contains a lot of hosts

rchaud commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.

I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error

Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.

ngsin commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.

I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error

Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.

the describe for awx-task pod

  awx-demo-ee:
    Container ID:  containerd://51b3acb3d02caeb4462bd64a1b089006aea6652a78d83643d005ba3e0307c9b4
    Image:         quay.io/ansible/awx-ee:23.2.0
    Image ID:      quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      if [ ! -f /etc/receptor/receptor.conf ]; then
        cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf
        sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf
      fi
      exec receptor --config /etc/receptor/receptor.conf

    State:          Running
      Started:      Wed, 03 Jan 2024 12:15:30 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     64Mi
    Environment:  <none>

and the describe for the job pod

Containers:
  worker:
    Container ID:  containerd://f761224536cecb6311a6a08fcb05090ff7bec290c55f0b6b1a7504865b5a1aef
    Image:         quay.io/ansible/awx-ee:23.2.0
    Image ID:      quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
    Port:          <none>
    Host Port:     <none>
    Args:
      ansible-runner
      worker
      --private-data-dir=/runner
    State:          Running
      Started:      Wed, 03 Jan 2024 13:18:17 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        250m
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>

rchaud commented 9 months ago

I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.

@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?

I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.

I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error

Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.

the describe for awx-task pod
  awx-demo-ee:
    Container ID:  containerd://51b3acb3d02caeb4462bd64a1b089006aea6652a78d83643d005ba3e0307c9b4
    Image:         quay.io/ansible/awx-ee:23.2.0
    Image ID:      quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      if [ ! -f /etc/receptor/receptor.conf ]; then
        cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf
        sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf
      fi
      exec receptor --config /etc/receptor/receptor.conf

    State:          Running
      Started:      Wed, 03 Jan 2024 12:15:30 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     64Mi
    Environment:  <none>
and the describe for the job pod
Containers:
  worker:
    Container ID:  containerd://f761224536cecb6311a6a08fcb05090ff7bec290c55f0b6b1a7504865b5a1aef
    Image:         quay.io/ansible/awx-ee:23.2.0
    Image ID:      quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
    Port:          <none>
    Host Port:     <none>
    Args:
      ansible-runner
      worker
      --private-data-dir=/runner
    State:          Running
      Started:      Wed, 03 Jan 2024 13:18:17 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        250m
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>

Can you check the logs in your awx-task pod for the awx-demo-ee container? I just tried with the controller using 23.2.0 and and the EE with latest, 23.2.0 and even a custom EE and it works.

$ kubectl -n awx-nonprod describe pod awx-nonprod-task-7c8888487-wfc27|grep "Image:" -B 2
  init:
    Container ID:  docker://755693739eba50d8206a4d088d5ba876bdfe11591cfe363488e2182c5d25a435
    Image:         quay.io/ansible/awx-ee:latest
--
  redis:
    Container ID:  docker://2a1645f2d203eb6bd41566bf5e6694543f8324c3d2940f589bb9f41b3755f6b7
    Image:         docker.io/redis:7
--
  awx-nonprod-task:
    Container ID:  docker://41d2dca0a4a2eb280c98bf2ef02a5d0069c60344ac60fe747e21820d2b75cf35
    Image:         quay.io/ansible/awx:23.2.0
--
  awx-nonprod-ee:
    Container ID:  docker://5568a15a85e738e00dbc5244946363c1be8a6f75f43f38b42e73338eb2868c3e
    Image:         quay.io/ansible/awx-ee:23.2.0
--
  awx-nonprod-rsyslog:
    Container ID:  docker://bff7db078e945debeb5c0dd7918168285bff089ca8d30909f8cb544a75ecf4c1
    Image:         quay.io/ansible/awx:23.2.0

Here is the Job with status Completed.

Containers:
  worker:
    Container ID:  docker://28a7f2f7fa86d9a1bc5aaadfc20543bd8c3d9e4ce7b599f15102ffbc6b4fca50
    Image:         quay.io/ansible/awx-ee:latest
    Image ID:      docker-pullable://quay.io/ansible/awx-ee@sha256:8b612e803b0ea93107d81bf56a2472f6f0950f102263859e385684ba5e3e7bee
    Port:          <none>
    Host Port:     <none>
    Args:
      ansible-runner
      worker
      --private-data-dir=/runner
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 03 Jan 2024 13:17:06 -0600
      Finished:     Wed, 03 Jan 2024 13:17:17 -0600
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        250m
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations:        node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  36m   default-scheduler  Successfully assigned awx-nonprod/automation-job-589816-9c4hn to NODE1
  Normal  Pulling    36m   kubelet            Pulling image "quay.io/ansible/awx-ee:latest"
  Normal  Pulled     35m   kubelet            Successfully pulled image "quay.io/ansible/awx-ee:latest" in 54.920892919s
  Normal  Created    35m   kubelet            Created container worker
  Normal  Started    35m   kubelet            Started container worker

mxs-weixiong commented 9 months ago

Same issue on k3s v1.28

Edit: it seems to be resolved after updating inotify:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

Hi @mis4s , can I know where to apply this changes? The host running the AWX container? Does it also have to apply changes to the worker node as well?

Since some container will run in other node as necessary.

marek1712 commented 9 months ago

@mxs-weixiong - yes, you apply it on the host. Worker nodes as well, I guess? https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

aryklein commented 9 months ago

~~I successfully executed the same AWX deployment in a different Kubernetes (K8s) cluster. Based on this experience, I followed the suggestion made by @mis4s in their comment on this issue: https://github.com/ansible/awx/issues/14693#issuecomment-1861167020. I'm pleased to report that this approach solved my issue as well. Thank you for the valuable input.~~

Edit: the issue happened again, so this is not the solution for me

CaptainStealthy commented 9 months ago

@chinochao Forgive me, as I'm a novice when it comes to Kubernetes - I am running AWX in my homelab on a K3s cluster.

I'm trying to follow the advice in your post, and update the awx-task deployment to match the version of awx-ee, rather than using latest.

But whenever I update the deployment, something overrides my change, and it reverts back to latest. I can even see the task pods being spun up, and then torn down again a few seconds later.

I don't understand how I'm supposed to make this change - even if I just simply scale down the task deployment to 0 replicas, it gets scaled back up to 1 on its own.

Any advice you can provide would be helpful :(

rchaud commented 9 months ago

@chinochao Forgive me, as I'm a novice when it comes to Kubernetes - I am running AWX in my homelab on a K3s cluster.

I'm trying to follow the advice in your post, and update the awx-task deployment to match the version of awx-ee, rather than using latest.

But whenever I update the deployment, something overrides my change, and it reverts back to latest. I can even see the task pods being spun up, and then torn down again a few seconds later.

I don't understand how I'm supposed to make this change - even if I just simply scale down the task deployment to 0 replicas, it gets scaled back up to 1 on its own.

Any advice you can provide would be helpful :(

If you are using the AWX Operator to deploy awx that will revert back any changes you make manually to the deployment. You will need to use the control_plane_ee_image from AWX Kind to update your deployment.

mis4s commented 9 months ago

@CaptainStealthy Try to check in left menu "Execution Environments" under administration

@mxs-weixiong on your nodes where AWX runs (taska and workers (EE)

rchaud commented 9 months ago

@CaptainStealthy Try to check in left menu "Execution Environments" under administration

@mxs-weixiong on your nodes where AWX runs (taska and workers (EE)

Execution Environment under Administration doesnt update your controller running image. The controller ee image is via the deployment file.

CaptainStealthy commented 9 months ago

I ended up changing the inotify config on my K3s hosts, as outlined here: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files.

I was only getting this "too many open files" error for any jobs that I queued using (any) custom EE - the default 'latest' AWX EE was working fine. Making the config change below solved it.

To increase these limits temporarily run the following commands on the host:
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512
To make the changes persistent, edit the file /etc/sysctl.conf and add these lines:
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512

2and3makes23 commented 8 months ago

Running AWX 23.5.1 on Openshift (Kubernetes version v1.26.9+aa37255)

We encounter the same 'JSON parsing' error message for what seems to be jobs with lots lines of output.

We are able to reproduce the issue with a job that produces 6000ish lines of playbook stdout, but haven´t been able to get to the root of the problem. Running this job, at some point AWX changes the status to the above mentioned 'JSON Parsing' error message and does only show the playbook output cut off. By disabling Release Receptor Work and looking at the log output within the job container, we can find the complete job output including the end {"eof": true}.

We have tried running the job with quay.io/ansible/awx-ee:latest as well as our own EE image builds, which did not have any noticeable effect on the outcome.

The above mentioned approach increasing inotify watches does not seem to do the trick (https://github.com/ansible/awx/issues/14693#issuecomment-1878856760) either (We are not able to test this ourselves).

Does anyone else have an idea what might be the root cause of this problem?

ryanph commented 8 months ago

Running AWX 23.5.1 on microk8s - encountering the same issue with awx-ee 23.6.0 specifically with playbooks that create large volumes of output.

jon-nfc commented 8 months ago

Running 23.5.1 on k3s 1.26.12 and custom ee image nofusscomputing/ansible-ee:dev (both for task and job pods)

I was also seeing the same issue on jobs with large logs ~>5-6k lines. Implementing the inotify changes on ALL nodes appears to have fixed the issue. Although time will tell.

Edit:

inotify changes didn't fix the issue.

2and3makes23 commented 8 months ago

Interestingly we haven´t experience this issue with AWX 23.2.0

We updated from 23.2.0 to 23.5.1 (in one go) recently. Only since then we see this behavior.

/edit: another noteworthy finding: We noticed that some jobs end with status 'Successful', but have a cut-off Output in AWX as well. This is quite a big deal for us, since up until now, we could distinguish based on status.

jurgen-ponds commented 8 months ago

I have the same issue with 23.6.0 and 23.7.0

What I noticed is that the GUI is stopped logging, but the job keeps on running. We looking in the logging of the execution environment (EE), I see the job still continues. In the end the AWX exits with an Error and the error message: "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'"

I ran a job several times with different EE's (even with the default AWX-ee) but the result was everytime around line 2011 it stopped logging in the GUI! But this differs per job template.

I set the inotify sysctl configuration, but that did not solve my issue.

btw: running on a 3 node k8s environment

BRE-MD commented 8 months ago

I have the same issue with 23.6.0 and 23.7.0

What I noticed is that the GUI is stopped logging, but the job keeps on running. We looking in the logging of the execution environment (EE), I see the job still continues. In the end the AWX exits with an Error and the error message: "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'"

I ran a job several times with different EE's (even with the default AWX-ee) but the result was everytime around line 2011 it stopped logging in the GUI! But this differs per job template.

btw: running on a 3 node k8s environment

Same issue with AWX 23.5.1

/tmp/receptor/.../..../stdout in the Task-Pod ends after about 50MB

jon-nfc commented 8 months ago

To aid in the fault finding As the issue is now occurring again.

I started the job that keeps failing, and from cli kubectl log for both the automation pod and task pod. within AWX as others have reported the output stops updating after some time, however the automation pod still remains running with output to stdout still coming. From the task pod however, I noticed a stack trace.

for reference:

automation job had an id of 1943
172.16.83.83 is web pod
172.16.100.1 is kubernetes api
automation pod stdout, time of first entry was 2024-02-03T02:39:20.260908
automation pod stdout, time of last entry was 2024-02-03T02:55:50.588410

Of note re the stack trace, contrary to not being able to reach the kubernetes API, on all nodes curl -k https://172.16.100.1/api/v1/namespaces does confirm connectivity. No other pods within the cluster have any connectivity issues with the API.

Task pod logs (with stack trace)

``` bash 2024-02-03 02:37:10,005 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-control-limit_reached_1 2024-02-03 02:37:10,005 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-jobs-status_changed 2024-02-03 02:37:13,002 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-schedules-changed 2024-02-03 02:37:33,007 INFO [-] awx.main.wsrelay Producer 172.16.83.83-schedules-changed has no subscribers, shutting down. 2024-02-03 02:38:09,126 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 waiting 2024-02-03 02:38:25,793 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 pre run 2024-02-03 02:38:26,903 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 start job fact cache 2024-02-03 02:38:32,679 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 created 2024-02-03 02:38:32,681 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 controller node chosen 2024-02-03 02:38:32,681 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 execution node chosen 2024-02-03 02:38:43,634 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 pre run 2024-02-03 02:38:44,696 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 preparing playbook 2024-02-03 02:38:46,511 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 running playbook 2024-02-03 02:38:46,549 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 work unit id received 2024-02-03 02:38:48,336 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 work unit id assigned 2024-02-03 02:38:56,122 INFO [765ac92478384572bc6b5c371801c004] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 1944 2024-02-03 02:38:56,128 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 post run 2024-02-03 02:39:01,482 INFO [-] awx.analytics.job_lifecycle projectupdate-1944 stats wrapup finished 2024-02-03 02:39:04,829 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle projectupdate-1944 finalize run 2024-02-03 02:39:09,933 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 preparing playbook 2024-02-03 02:39:16,917 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 running playbook 2024-02-03 02:39:17,662 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 work unit id received 2024-02-03 02:39:21,995 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 work unit id assigned 2024-02-03 02:40:42,989 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.2935323715209961 seconds in the past 2024-02-03 02:41:54,998 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of metrics_gather is 0.30237483978271484 seconds in the past 2024-02-03 02:43:03,585 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.8900060653686523 seconds in the past 2024-02-03 02:46:23,586 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.8908219337463379 seconds in the past 2024-02-03 02:48:37,005 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of tower_scheduler is 0.3094046115875244 seconds in the past 2024-02-03 02:50:37,099 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of tower_scheduler is 0.4039294719696045 seconds in the past 2024-02-03 02:51:54,606 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-schedules-changed 2024-02-03 02:52:04,609 INFO [-] awx.main.wsrelay Producer 172.16.83.83-schedules-changed has no subscribers, shutting down. 2024-02-03 02:52:08,008 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-job_events-1943 2024-02-03 02:52:08,009 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-jobs-summary 2024-02-03 02:52:38,015 INFO [-] awx.main.wsrelay Producer 172.16.83.83-jobs-summary has no subscribers, shutting down. 2024-02-03 02:52:38,017 INFO [-] awx.main.wsrelay Producer 172.16.83.83-job_events-1943 has no subscribers, shutting down. 2024-02-03 02:52:43,581 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.8853375911712646 seconds in the past 2024-02-03 02:52:43,689 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of k8s_reaper is -0.006358623504638672 seconds in the past 2024-02-03 02:54:48,333 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of send_subsystem_metrics is 1.637815237045288 seconds in the past 2024-02-03 02:55:54,744 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of metrics_gather is 0.048349857330322266 seconds in the past 2024-02-03 02:57:03,002 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.3068995475769043 seconds in the past 2024-02-03 02:57:11,000 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-job_events-1943 2024-02-03 02:57:11,001 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-jobs-summary 2024-02-03 02:57:31,006 INFO [-] awx.main.wsrelay Producer 172.16.83.83-job_events-1943 has no subscribers, shutting down. 2024-02-03 02:57:31,009 INFO [-] awx.main.wsrelay Producer 172.16.83.83-jobs-summary has no subscribers, shutting down. 2024-02-03 02:59:30,228 INFO [-] awx.main.wsrelay Starting producer for 172.16.83.83-schedules-changed 2024-02-03 03:00:03,128 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch.periodic Scheduler next run of dependency_manager is 0.43228983879089355 seconds in the past 2024-02-03 03:01:47,289 ERROR [765ac92478384572bc6b5c371801c004] awx.main.scheduler Failed to list pods for container group default-2 Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connection.py", line 358, in connect self.sock = conn = self._new_conn() File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/scheduler/kubernetes.py", line 28, in list_active_jobs for pod in pm.kube_api.list_namespaced_pod(pm.namespace, label_selector='ansible-awx={}'.format(settings.INSTALL_UUID)).to_dict().get('items', []): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod return self.list_namespaced_pod_with_http_info(namespace, **kwargs) # noqa: E501 File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info return self.api_client.call_api( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request return self.rest_client.GET(url, File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET return self.request("GET", url, File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/kubernetes/client/rest.py", line 214, in request r = self.pool_manager.request(method, url, File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/request.py", line 74, in request return self.request_encode_url( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/request.py", line 96, in request_encode_url return self.urlopen(method, url, **extra_kw) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/poolmanager.py", line 376, in urlopen response = conn.urlopen(method, u.request_uri, **kw) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='172.16.100.1', port=443): Max retries exceeded with url: /api/v1/namespaces/automation/pods?labelSelector=ansible-awx%3Dd62ba89b-9f56-4502-ace6-573eedefd3f6 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused')) 2024-02-03 03:01:51,107 INFO [765ac92478384572bc6b5c371801c004] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 1943 2024-02-03 03:01:51,409 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 post run 2024-02-03 03:02:05,669 INFO [765ac92478384572bc6b5c371801c004] awx.analytics.job_lifecycle job-1943 finalize run 2024-02-03 03:02:07,268 WARNING [765ac92478384572bc6b5c371801c004] awx.main.dispatch job 1943 (error) encountered an error (rc=None), please see task stdout for details. 2024-02-03 03:08:30,437 INFO [-] awx.main.wsrelay Producer 172.16.83.83-schedules-changed has no subscribers, shutting down. ```

Does anyone else have similar stack traces in the task pod for a job that fails?

ryanph commented 8 months ago

I started a job which is intermittently failing, ID 43635

Logs from the task pod just say the job encountered an error. UI shows "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''

2024-02-05 08:03:06,729 INFO     [-] awx.main.wsrelay Starting producer for 10.1.227.206-schedules-changed
2024-02-05 08:03:16,731 INFO     [-] awx.main.wsrelay Producer 10.1.227.206-schedules-changed has no subscribers, shutting down.
2024-02-05 08:03:52,899 INFO     [5c37d224fc934a34b13732fece0bc963] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 43634
2024-02-05 08:03:52,933 INFO     [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 post run
2024-02-05 08:03:52,998 INFO     [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 finish job fact cache
2024-02-05 08:03:53,784 INFO     [-] awx.analytics.job_lifecycle job-43634 stats wrapup finished
2024-02-05 08:03:54,717 INFO     [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 finalize run
2024-02-05 08:04:01,502 INFO     [-] awx.main.wsrelay Starting producer for 10.1.227.206-schedules-changed
2024-02-05 08:04:02,906 INFO     [-] awx.main.wsrelay Producer 10.1.227.206-jobs-summary has no subscribers, shutting down.
2024-02-05 08:04:03,713 INFO     [-] awx.main.wsrelay Producer 10.1.227.206-job_events-43634 has no subscribers, shutting down.
2024-02-05 08:04:30,528 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 waiting
2024-02-05 08:04:30,673 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 pre run
2024-02-05 08:04:30,771 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 start job fact cache
2024-02-05 08:04:31,508 INFO     [-] awx.main.wsrelay Producer 10.1.227.206-schedules-changed has no subscribers, shutting down.
2024-02-05 08:04:32,014 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 preparing playbook
2024-02-05 08:04:32,139 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 running playbook
2024-02-05 08:04:32,356 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 work unit id received
2024-02-05 08:04:32,391 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 work unit id assigned
2024-02-05 08:04:33,546 INFO     [-] awx.main.wsrelay Starting producer for 10.1.227.206-job_events-43635
2024-02-05 08:04:33,546 INFO     [-] awx.main.wsrelay Starting producer for 10.1.227.206-jobs-summary
2024-02-05 08:13:35,623 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 43635
2024-02-05 08:13:35,634 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 post run
2024-02-05 08:13:35,747 INFO     [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 finalize run
2024-02-05 08:13:35,765 WARNING  [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.main.dispatch job 43635 (error) encountered an error (rc=None), please see task stdout for details.

The automation container pod didn't log any errors to the pod logs, just the expected JSON log lines the last few of which appeared to be perfectly valid.

blundey commented 8 months ago

Also having the same issue on the version of AWX 23.6. Running a simple one task playbook that runs uptime on 43 hosts. The inventory does have hosts with lots of groups/vars. Wondering if this may be related.

Tried all the suggestions in here to no avail. Seems this is occuring on all 23.x versions?

Found the issue pertains to having gather_facts enabled. If you set gather_facts: no - then the issue does not occur!

BlankBanshee01 commented 8 months ago

Running AWX 23.3.0 on K8s v1.24.17+rke2r1 11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.

Tried different Execution environments (one the same as the running task image version) also the latest version.
Tried increasing the inotify kernel values on one node and scheduled the automation pod on the node.
Tried setting verbosity to high in the template.
Tried limiting the target hosts to just 20 hosts (same problem), 10 hosts was fine tho.
Tried disabling facts gathering and setting no_log on the tasks.
Tried a combination of all those things above multiple times.

None of which did work, still facing the same issue.

blundey commented 8 months ago

Running AWX 23.3.0 on K8s v1.24.17+rke2r1 11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.
* Tried different Execution environments (one the same as the running task image version) also the latest version.

* Tried increasing the inotify kernel values on one node and scheduled the automation pod on the node.

* Tried setting verbosity to high in the template.

* Tried limiting the target hosts to just 20 hosts (same problem), 10 hosts was fine tho.

* Tried disabling facts gathering and setting no_log on the tasks.

* Tried a combination of all those things above multiple times.
None of which did work, still facing the same issue.

Interesting, so you added the line "gather_facts: no" to the playbook and it still errored? Thats the only change i made, and a playbook that failed every time, continued without issue there after.

BlankBanshee01 commented 8 months ago

Running AWX 23.3.0 on K8s v1.24.17+rke2r1 11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.
* Tried different Execution environments (one the same as the running task image version) also the latest version.

* Tried increasing the inotify kernel values on one node and scheduled the automation pod on the node.

* Tried setting verbosity to high in the template.

* Tried limiting the target hosts to just 20 hosts (same problem), 10 hosts was fine tho.

* Tried disabling facts gathering and setting no_log on the tasks.

* Tried a combination of all those things above multiple times.
None of which did work, still facing the same issue.
Interesting, so you added the line "gather_facts: no" to the playbook and it still errored? Thats the only change i made, and a playbook that failed every time, continued without issue there after.

Yes i explicitly set gather_facts to false. In your case when you set gather_facts to false it might have been that you had less logs produced (TASK [Gathering Facts] ***** ok.....) or the hostvars and facts object didn't populate with much more variables - In case the problem had to do with long log outputs or lots of hostvars respectively.

In my case i have a lot of log output for each user i create on each host. Also each host has an average of 40 variables assigned to it.

TheRealHaoLiu commented 8 months ago

wow, thanks everyone for providing information here

I'm going to gather some related places this particular error message and it's discussion is happening for information

we are looking into this

https://forum.ansible.com/t/execution-environment-failed-to-run-any-job-on-awx-ee-latest-and-awx-ee-23-6-0/3291 https://forum.ansible.com/t/awx-job-terminated-unexpectedly/2256

TheRealHaoLiu commented 8 months ago

I did some digging and found a "possible" related bug https://github.com/ansible/receptor/pull/939

for more detail on the debug and thought process behind see https://forum.ansible.com/t/awx-job-terminated-unexpectedly/2256/15?u=therealhaoliu

If anyone is willing to try out patched awx-ee with the fix please give quay.io/haoliu/awx-ee:receptor-939 a try

ansible / awx