Open aryklein opened 10 months ago
Hello, we'll need some further information. Can you please set the following settings AWX_CLEANUP_PATHS = False and RECEPTOR_RELEASEWORK = False. Then check the /tmp/awx
can u also provide the receptor log and what kind of kubernetes you are using
can you retry with the latest EE image for the controlplane EE... u can do this by changing the imagepullpolicy in awx to Alaways than switch it back to IfNotPresent
can u also provide the receptor log and what kind of kubernetes you are using
Kubernetes v1.24.2 self managed by kubeadm How could I get the receptor log?
I'm Apologies for my limited knowledge on this subject. I attempted to include AWX_CLEANUP_PATHS=False and RECEPTOR_RELEASE_WORK=False in the configMap, but couldn't locate the appropriate section for their addition. Consequently, I chose to input them in the web UI, specifically under Settings, within the Job section, and in Extra Environment Variables. Could you please confirm if this approach is correct?
I achieved the desired configuration by editing the AWX object. Specifically, I added the extra settings in the YAML file under the 'spec' section. The adjustments were as follows:
extra_settings:
- setting: RECEPTOR_RELEASE_WORK
value: "False"
- setting: AWX_CLEANUP_PATHS
value: "False"
@jessicamack,
Then check the /tmp/awx_
_/artifacts/ /job_events directory. Confirm all the files there are JSON.
The directory is empty:
kubectl exec -it awx-task-78cbf7c589-bzgd8 -c awx-task -- bash
bash-5.1# ls tmp/awx_7976_m5_ejtdl/artifacts/7976/job_events/
bash-5.1#
Here you also have the logs from job pod: automation-job-7976-z6vnr.log
your provided job output log looks good at first glance. Does the UI job output stdout page show all of those events?
where are you seeing Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'
exactly in the UI?
can you provide a screenshot of it?
In the provided screenshot, line 6950 appears as the final line visible within the user interface. As a next step, I plan to conduct an experiment by deploying the same AWX setup on EKS, as opposed to using my self-managed Kubernetes cluster.
BTW, I found 2 more users with the same issue: https://www.reddit.com/r/awx/comments/176za7y/issue_with_json_parsing_error_in_awx_2320_on/
Same issue. Cluster version v1.24.12+rke2r1, AWX 23.3.1, the error occurs randomly, yesterday everything worked on custom Execution Environments
UPD: When I set env https://github.com/ansible/receptor/pull/683 ee_extra_env: |
Hope this helps you
Hello @aryklein thank you for providing those additional screenshots.
Could you go to Settings> TroubleShooting Settings> Edit> From here, turn off temp dir cleanup and receptor release work
And then, in the control plane ee container (this is located in the task pod) get the/tmp/awx_<job_id>_<"*">
And also get the/tmp/receptor/<node name>/<work unit id>
You should be able to get the work unit id from the API for that job run.
Please provide us with the artifacts directory and the stdout file.
@djyasin I think I did it here right? https://github.com/ansible/awx/issues/14693#issuecomment-1836626977
I got this error as well.
"Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''
AWX 23.4.0 Execution Environment: 23.0.0
I no idea what is causing this. Logs doesn't tell me anything. It will occur at different job.
If anyone got idea on how to troubleshoot this , please advise me.
Thanks.
+1
My details: Kubeadm cluster 1.28 on proxmox AWX-Operator : 2.9.0 (helm, installed by argo with default values) AWX definition with only spec for ingress.
Logs are the same in my case. It looks like that tasks are completed, but AWX fails when read status from execution environment.
Also when Verbosity is set to debug tasks are sucessful, but not always (~70% chances)
I recently migrated my AWX deployment to EKS, with Kubernetes version 1.25, and the issue has completely vanished
Same issue on k3s v1.28
Edit: it seems to be resolved after updating inotify:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288
Same issue with : AWX 23.5.1 AWX OPERATOR : 2.9.0 K3S : v1.27.2+k3s1
Im experiencing the same issue in 23.3.0 and K8s 1.24. I will dig deeper to troubleshoot more and capture some logs after the holidays.
Same issue on k3s v1.28
Thanks!
I was on 1.27 and inotify increases didn't change a thing. Updating to 1.28 helped.
hi, i have same error on awx 23.5.1. I restore DB frrom old k3s server, project sync work without problem,but all template terminate with error: Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''
Template OUTPUT: Worker output: ": "/tmp/awx_3386_b795lk3f", "JOB_ID": "3386", "INVENTORY_ID": "3", "PROJECT_REVISION": "e6d9f8f52adb8de7d052019a05326020c0d1cc4a", "ANSIBLE_RETRY_FILES_ENABLED": "False", "MAX_EVENT_RES": "700000", "AWX_HOST": "https://awx.xxx.xxx.loc", "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp", "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles", "ANSIBLE_COLLECTIONS_PATH": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections", "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.9/site-packages/ansible_runner/display_callback/callback", "ANSIBLE_STDOUT_CALLBACK": "awx_display", "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/3386", "RUNNER_OMIT_EVENTS": "False", "RUNNER_ONLY_FAILED_EVENTS": "False"}, "cwd": "/runner/project"} {"status": "running", "runner_ident": "3386"}.
i try with more different release off EE, but problem same. Have Idea? Bye Mattia
@mattiaoui Upgrade your K3s to 1.28. Then follow this tip: https://github.com/ansible/awx/issues/14693#issuecomment-1861167020
@mattiaoui Upgrade your K3s to 1.28. Then follow this tip:
https://github.com/ansible/awx/issues/14693#issuecomment-1861167020
Many Thanks,after upgrade all template work 🤟🤟. Happy Holiday and New Years
In my case, temporary "fix" was setting Debug(3) in Verbosity on all my templates. After this change, all scheduled tasks was successful for about a week.
But during upgrade to k8s 1.29 I noticed that is issue with kube-proxy on node with awx, because it was still in CrashBootloop state. I applied fix from here https://github.com/kubernetes-sigs/kind/issues/2744#issuecomment-1127808069 and it fixed my problem
But it is basically the same as @marek1712 mentioned, but with other values.
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image).
I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me.
i think it is something with the latest image running as a controller.
I did not make any changes to my sysctl inodes or anything else.
I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version.
I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image).
I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me.
i think it is something with the latest image running as a controller.
I did not make any changes to my sysctl inodes or anything else.
I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version.
I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.
I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error
Error probabilities occur when job contains a lot of hosts
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.
I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error
Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.
I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error
Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.
the describe for awx-task pod
awx-demo-ee:
Container ID: containerd://51b3acb3d02caeb4462bd64a1b089006aea6652a78d83643d005ba3e0307c9b4
Image: quay.io/ansible/awx-ee:23.2.0
Image ID: quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
if [ ! -f /etc/receptor/receptor.conf ]; then
cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf
sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf
fi
exec receptor --config /etc/receptor/receptor.conf
State: Running
Started: Wed, 03 Jan 2024 12:15:30 +0800
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 64Mi
Environment: <none>
and the describe for the job pod
Containers:
worker:
Container ID: containerd://f761224536cecb6311a6a08fcb05090ff7bec290c55f0b6b1a7504865b5a1aef
Image: quay.io/ansible/awx-ee:23.2.0
Image ID: quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05
Port: <none>
Host Port: <none>
Args:
ansible-runner
worker
--private-data-dir=/runner
State: Running
Started: Wed, 03 Jan 2024 13:18:17 +0800
Ready: True
Restart Count: 0
Requests:
cpu: 250m
memory: 100Mi
Environment: <none>
Mounts: <none>
I think I have found the issue. It seems that all tasks controller pods are running 4 containers which are redis, task(awx image), rsyslog(awx image) and ee( awx-ee image). I did notice that ee container in the pods were running awx-ee:latest. I changed that deployment to use the same awx-ee version as the awx deployment and it resolved the problem for me. i think it is something with the latest image running as a controller. I did not make any changes to my sysctl inodes or anything else. I am running version 23.2.0 and everything is working perfectly after matching the ee container to my running awx version. I suspect it might be a bug introduced in the ansible/awx-ee . I do see it was not updated since Jun and in Nov and Dec they had some updates.
@chinochao it seem like a bug in ansible runner (a python package awx would use it ), btw, which awx-ee version do you use?
I am using 23.2.0 for awx and awx-ee in the deployment. For the EE in the AWX UI, I have latest configured. It seems the issue is using awx-ee as latest in the controller task containers.
I set control_plane_ee_image and Execution Environment ee image with same version, and try both 23.2.0 and 23.3.1. The execution still fails with the json parse error
Can you provide the output from kubectl to see and make sure the awx-ee image is not using latest? Something like kubectl describe to one of the task containers.
the describe for awx-task pod
awx-demo-ee: Container ID: containerd://51b3acb3d02caeb4462bd64a1b089006aea6652a78d83643d005ba3e0307c9b4 Image: quay.io/ansible/awx-ee:23.2.0 Image ID: quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05 Port: <none> Host Port: <none> Args: /bin/sh -c if [ ! -f /etc/receptor/receptor.conf ]; then cp /etc/receptor/receptor-default.conf /etc/receptor/receptor.conf sed -i "s/HOSTNAME/$HOSTNAME/g" /etc/receptor/receptor.conf fi exec receptor --config /etc/receptor/receptor.conf State: Running Started: Wed, 03 Jan 2024 12:15:30 +0800 Ready: True Restart Count: 0 Requests: cpu: 100m memory: 64Mi Environment: <none>
and the describe for the job pod
Containers: worker: Container ID: containerd://f761224536cecb6311a6a08fcb05090ff7bec290c55f0b6b1a7504865b5a1aef Image: quay.io/ansible/awx-ee:23.2.0 Image ID: quay.io/ansible/awx-ee@sha256:3695f4c0d11557da70fe8182ffe9046d5ce1df0770137f08f51705ab6c08ae05 Port: <none> Host Port: <none> Args: ansible-runner worker --private-data-dir=/runner State: Running Started: Wed, 03 Jan 2024 13:18:17 +0800 Ready: True Restart Count: 0 Requests: cpu: 250m memory: 100Mi Environment: <none> Mounts: <none>
Can you check the logs in your awx-task pod for the awx-demo-ee container? I just tried with the controller using 23.2.0 and and the EE with latest, 23.2.0 and even a custom EE and it works.
$ kubectl -n awx-nonprod describe pod awx-nonprod-task-7c8888487-wfc27|grep "Image:" -B 2
init:
Container ID: docker://755693739eba50d8206a4d088d5ba876bdfe11591cfe363488e2182c5d25a435
Image: quay.io/ansible/awx-ee:latest
--
redis:
Container ID: docker://2a1645f2d203eb6bd41566bf5e6694543f8324c3d2940f589bb9f41b3755f6b7
Image: docker.io/redis:7
--
awx-nonprod-task:
Container ID: docker://41d2dca0a4a2eb280c98bf2ef02a5d0069c60344ac60fe747e21820d2b75cf35
Image: quay.io/ansible/awx:23.2.0
--
awx-nonprod-ee:
Container ID: docker://5568a15a85e738e00dbc5244946363c1be8a6f75f43f38b42e73338eb2868c3e
Image: quay.io/ansible/awx-ee:23.2.0
--
awx-nonprod-rsyslog:
Container ID: docker://bff7db078e945debeb5c0dd7918168285bff089ca8d30909f8cb544a75ecf4c1
Image: quay.io/ansible/awx:23.2.0
Here is the Job with status Completed.
Containers:
worker:
Container ID: docker://28a7f2f7fa86d9a1bc5aaadfc20543bd8c3d9e4ce7b599f15102ffbc6b4fca50
Image: quay.io/ansible/awx-ee:latest
Image ID: docker-pullable://quay.io/ansible/awx-ee@sha256:8b612e803b0ea93107d81bf56a2472f6f0950f102263859e385684ba5e3e7bee
Port: <none>
Host Port: <none>
Args:
ansible-runner
worker
--private-data-dir=/runner
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 03 Jan 2024 13:17:06 -0600
Finished: Wed, 03 Jan 2024 13:17:17 -0600
Ready: False
Restart Count: 0
Requests:
cpu: 250m
memory: 100Mi
Environment: <none>
Mounts: <none>
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes: <none>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36m default-scheduler Successfully assigned awx-nonprod/automation-job-589816-9c4hn to NODE1
Normal Pulling 36m kubelet Pulling image "quay.io/ansible/awx-ee:latest"
Normal Pulled 35m kubelet Successfully pulled image "quay.io/ansible/awx-ee:latest" in 54.920892919s
Normal Created 35m kubelet Created container worker
Normal Started 35m kubelet Started container worker
Same issue on k3s v1.28
Edit: it seems to be resolved after updating inotify:
fs.inotify.max_user_instances = 8192 fs.inotify.max_user_watches = 524288
Hi @mis4s , can I know where to apply this changes? The host running the AWX container? Does it also have to apply changes to the worker node as well?
Since some container will run in other node as necessary.
@mxs-weixiong - yes, you apply it on the host. Worker nodes as well, I guess? https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files
~~I successfully executed the same AWX deployment in a different Kubernetes (K8s) cluster. Based on this experience, I followed the suggestion made by @mis4s in their comment on this issue: https://github.com/ansible/awx/issues/14693#issuecomment-1861167020. I'm pleased to report that this approach solved my issue as well. Thank you for the valuable input.~~
Edit: the issue happened again, so this is not the solution for me
@chinochao Forgive me, as I'm a novice when it comes to Kubernetes - I am running AWX in my homelab on a K3s cluster.
I'm trying to follow the advice in your post, and update the awx-task
deployment to match the version of awx-ee
, rather than using latest
.
But whenever I update the deployment, something overrides my change, and it reverts back to latest
. I can even see the task pods being spun up, and then torn down again a few seconds later.
I don't understand how I'm supposed to make this change - even if I just simply scale down the task deployment to 0 replicas, it gets scaled back up to 1 on its own.
Any advice you can provide would be helpful :(
@chinochao Forgive me, as I'm a novice when it comes to Kubernetes - I am running AWX in my homelab on a K3s cluster.
I'm trying to follow the advice in your post, and update the
awx-task
deployment to match the version ofawx-ee
, rather than usinglatest
.But whenever I update the deployment, something overrides my change, and it reverts back to
latest
. I can even see the task pods being spun up, and then torn down again a few seconds later.I don't understand how I'm supposed to make this change - even if I just simply scale down the task deployment to 0 replicas, it gets scaled back up to 1 on its own.
Any advice you can provide would be helpful :(
If you are using the AWX Operator to deploy awx that will revert back any changes you make manually to the deployment. You will need to use the control_plane_ee_image from AWX Kind to update your deployment.
You can read more about it at https://github.com/ansible/awx-operator/blob/07427be0b79411f5c8fd2ffd7caf4fd92fcce47a/docs/user-guide/advanced-configuration/default-execution-environments-from-private-registries.md?plain=1#L23
@CaptainStealthy Try to check in left menu "Execution Environments" under administration
@mxs-weixiong on your nodes where AWX runs (taska and workers (EE)
@CaptainStealthy Try to check in left menu "Execution Environments" under administration
@mxs-weixiong on your nodes where AWX runs (taska and workers (EE)
Execution Environment under Administration doesnt update your controller running image. The controller ee image is via the deployment file.
I ended up changing the inotify
config on my K3s hosts, as outlined here: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files.
I was only getting this "too many open files" error for any jobs that I queued using (any) custom EE - the default 'latest' AWX EE was working fine. Making the config change below solved it.
To increase these limits temporarily run the following commands on the host:
sudo sysctl fs.inotify.max_user_watches=524288 sudo sysctl fs.inotify.max_user_instances=512
To make the changes persistent, edit the file /etc/sysctl.conf and add these lines:
fs.inotify.max_user_watches = 524288 fs.inotify.max_user_instances = 512
Running AWX 23.5.1 on Openshift (Kubernetes version v1.26.9+aa37255)
We encounter the same 'JSON parsing' error message for what seems to be jobs with lots lines of output.
We are able to reproduce the issue with a job that produces 6000ish lines of playbook stdout, but haven´t been able to get to the root of the problem.
Running this job, at some point AWX changes the status to the above mentioned 'JSON Parsing' error message and does only show the playbook output cut off. By disabling Release Receptor Work
and looking at the log output within the job container, we can find the complete job output including the end {"eof": true}
.
We have tried running the job with quay.io/ansible/awx-ee:latest as well as our own EE image builds, which did not have any noticeable effect on the outcome.
The above mentioned approach increasing inotify watches does not seem to do the trick (https://github.com/ansible/awx/issues/14693#issuecomment-1878856760) either (We are not able to test this ourselves).
Does anyone else have an idea what might be the root cause of this problem?
Running AWX 23.5.1 on microk8s - encountering the same issue with awx-ee 23.6.0 specifically with playbooks that create large volumes of output.
Running 23.5.1 on k3s 1.26.12 and custom ee image nofusscomputing/ansible-ee:dev
(both for task and job pods)
I was also seeing the same issue on jobs with large logs ~>5-6k lines. Implementing the inotify changes on ALL nodes appears to have fixed the issue. Although time will tell.
Edit:
Interestingly we haven´t experience this issue with AWX 23.2.0
We updated from 23.2.0 to 23.5.1 (in one go) recently. Only since then we see this behavior.
/edit: another noteworthy finding: We noticed that some jobs end with status 'Successful', but have a cut-off Output in AWX as well. This is quite a big deal for us, since up until now, we could distinguish based on status.
I have the same issue with 23.6.0 and 23.7.0
What I noticed is that the GUI is stopped logging, but the job keeps on running. We looking in the logging of the execution environment (EE), I see the job still continues. In the end the AWX exits with an Error and the error message: "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'"
I ran a job several times with different EE's (even with the default AWX-ee) but the result was everytime around line 2011 it stopped logging in the GUI! But this differs per job template.
I set the inotify sysctl configuration, but that did not solve my issue.
btw: running on a 3 node k8s environment
I have the same issue with 23.6.0 and 23.7.0
What I noticed is that the GUI is stopped logging, but the job keeps on running. We looking in the logging of the execution environment (EE), I see the job still continues. In the end the AWX exits with an Error and the error message: "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'"
I ran a job several times with different EE's (even with the default AWX-ee) but the result was everytime around line 2011 it stopped logging in the GUI! But this differs per job template.
btw: running on a 3 node k8s environment
Same issue with AWX 23.5.1
/tmp/receptor/.../..../stdout in the Task-Pod ends after about 50MB
To aid in the fault finding As the issue is now occurring again.
I started the job that keeps failing, and from cli kubectl log
for both the automation pod and task pod. within AWX as others have reported the output stops updating after some time, however the automation pod still remains running with output to stdout still coming. From the task pod however, I noticed a stack trace.
for reference:
172.16.83.83
is web pod172.16.100.1
is kubernetes api2024-02-03T02:39:20.260908
2024-02-03T02:55:50.588410
Of note re the stack trace, contrary to not being able to reach the kubernetes API, on all nodes curl -k https://172.16.100.1/api/v1/namespaces
does confirm connectivity. No other pods within the cluster have any connectivity issues with the API.
Does anyone else have similar stack traces in the task pod for a job that fails?
I started a job which is intermittently failing, ID 43635
Logs from the task pod just say the job encountered an error. UI shows "Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''
2024-02-05 08:03:06,729 INFO [-] awx.main.wsrelay Starting producer for 10.1.227.206-schedules-changed
2024-02-05 08:03:16,731 INFO [-] awx.main.wsrelay Producer 10.1.227.206-schedules-changed has no subscribers, shutting down.
2024-02-05 08:03:52,899 INFO [5c37d224fc934a34b13732fece0bc963] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 43634
2024-02-05 08:03:52,933 INFO [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 post run
2024-02-05 08:03:52,998 INFO [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 finish job fact cache
2024-02-05 08:03:53,784 INFO [-] awx.analytics.job_lifecycle job-43634 stats wrapup finished
2024-02-05 08:03:54,717 INFO [5c37d224fc934a34b13732fece0bc963] awx.analytics.job_lifecycle job-43634 finalize run
2024-02-05 08:04:01,502 INFO [-] awx.main.wsrelay Starting producer for 10.1.227.206-schedules-changed
2024-02-05 08:04:02,906 INFO [-] awx.main.wsrelay Producer 10.1.227.206-jobs-summary has no subscribers, shutting down.
2024-02-05 08:04:03,713 INFO [-] awx.main.wsrelay Producer 10.1.227.206-job_events-43634 has no subscribers, shutting down.
2024-02-05 08:04:30,528 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 waiting
2024-02-05 08:04:30,673 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 pre run
2024-02-05 08:04:30,771 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 start job fact cache
2024-02-05 08:04:31,508 INFO [-] awx.main.wsrelay Producer 10.1.227.206-schedules-changed has no subscribers, shutting down.
2024-02-05 08:04:32,014 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 preparing playbook
2024-02-05 08:04:32,139 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 running playbook
2024-02-05 08:04:32,356 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 work unit id received
2024-02-05 08:04:32,391 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 work unit id assigned
2024-02-05 08:04:33,546 INFO [-] awx.main.wsrelay Starting producer for 10.1.227.206-job_events-43635
2024-02-05 08:04:33,546 INFO [-] awx.main.wsrelay Starting producer for 10.1.227.206-jobs-summary
2024-02-05 08:13:35,623 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 43635
2024-02-05 08:13:35,634 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 post run
2024-02-05 08:13:35,747 INFO [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.analytics.job_lifecycle job-43635 finalize run
2024-02-05 08:13:35,765 WARNING [a0f04f4eefd44e7e89e8d65ab24d3fba] awx.main.dispatch job 43635 (error) encountered an error (rc=None), please see task stdout for details.
The automation container pod didn't log any errors to the pod logs, just the expected JSON log lines the last few of which appeared to be perfectly valid.
Also having the same issue on the version of AWX 23.6. Running a simple one task playbook that runs uptime on 43 hosts. The inventory does have hosts with lots of groups/vars. Wondering if this may be related.
Tried all the suggestions in here to no avail. Seems this is occuring on all 23.x versions?
Found the issue pertains to having gather_facts enabled. If you set gather_facts: no - then the issue does not occur!
Running AWX 23.3.0
on K8s v1.24.17+rke2r1
11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.
None of which did work, still facing the same issue.
Running
AWX 23.3.0
onK8s v1.24.17+rke2r1
11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.* Tried different Execution environments (one the same as the running task image version) also the latest version. * Tried increasing the inotify kernel values on one node and scheduled the automation pod on the node. * Tried setting verbosity to high in the template. * Tried limiting the target hosts to just 20 hosts (same problem), 10 hosts was fine tho. * Tried disabling facts gathering and setting no_log on the tasks. * Tried a combination of all those things above multiple times.
None of which did work, still facing the same issue.
Interesting, so you added the line "gather_facts: no" to the playbook and it still errored? Thats the only change i made, and a playbook that failed every time, continued without issue there after.
Running
AWX 23.3.0
onK8s v1.24.17+rke2r1
11 beefy nodes cluster - Experiencing the same issue running a role that creates and adds keys to users on lots of hosts ~700 - which generates a lot of log output.* Tried different Execution environments (one the same as the running task image version) also the latest version. * Tried increasing the inotify kernel values on one node and scheduled the automation pod on the node. * Tried setting verbosity to high in the template. * Tried limiting the target hosts to just 20 hosts (same problem), 10 hosts was fine tho. * Tried disabling facts gathering and setting no_log on the tasks. * Tried a combination of all those things above multiple times.
None of which did work, still facing the same issue.
Interesting, so you added the line "gather_facts: no" to the playbook and it still errored? Thats the only change i made, and a playbook that failed every time, continued without issue there after.
Yes i explicitly set gather_facts to false. In your case when you set gather_facts to false it might have been that you had less logs produced (TASK [Gathering Facts] ***** ok.....) or the hostvars and facts object didn't populate with much more variables - In case the problem had to do with long log outputs or lots of hostvars respectively.
In my case i have a lot of log output for each user i create on each host. Also each host has an average of 40 variables assigned to it.
wow, thanks everyone for providing information here
I'm going to gather some related places this particular error message and it's discussion is happening for information
we are looking into this
https://forum.ansible.com/t/execution-environment-failed-to-run-any-job-on-awx-ee-latest-and-awx-ee-23-6-0/3291 https://forum.ansible.com/t/awx-job-terminated-unexpectedly/2256
I did some digging and found a "possible" related bug https://github.com/ansible/receptor/pull/939
for more detail on the debug and thought process behind see https://forum.ansible.com/t/awx-job-terminated-unexpectedly/2256/15?u=therealhaoliu
If anyone is willing to try out patched awx-ee with the fix please give quay.io/haoliu/awx-ee:receptor-939
a try
Please confirm the following
security@ansible.com
instead.)Bug Summary
I've encountered an issue with some of my jobs in AWX 23.3.1, and I’m hoping to gather insights or solutions from anyone who might have faced a similar problem.
Error Message
The execution fails with an error message on the details tab:
Environment
AWX Version: 23.3.1 Deployment Platform: Kubernetes Operator Version: 2.7.2
I'm inclined to think that this issue stems from a UI error during the process of logging data from the pod, although I'm not entirely certain.
Any Idea?
AWX version
23.3.1
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
ansible [core 2.16.0]
Operating system
Linux
Web browser
Firefox, Chrome
Steps to reproduce
Randomly at executing some templates
Expected results
Finish the Ansible playbook execution without errors
Actual results
The Jobs don't fail, they just finish the execution with error:
Additional information
I tried using different Ansible EE version (included the latest one)