Closed cmasopust closed 1 year ago
it's not only the inventory that times out after 5 minutes, any job that has long running tasks (no matter whether they produce output or not, e.g. a simple pause for more than 5 minutes) are killed after 5 minutes.
Can you find any related logs from around the time it timed out? What information do the API details show at /api/v2/inventory_updates/(id)/?
There is nothing in the job logs that shows any error. When running this simple playbook: `---
hosts: localhost connection: local gather_facts: no become: no
tasks:
Did now a strace of the ansible-runner process in the execution environment and found that, after 5 minutes, it gets a SIGTERM.
Did some further strace of the processes in the execution environment pod and found that the "init process" dumb-init is receiving the SIGTERM signal, so it looks like the pod is terminated from "outside". So the question remains: who's terminating my pod? Is it AWX or is it Kubernetes?
I got the same problem, all plays which take longer as 5 minutes (without posting any output) are stopping with the status: Error
Only thing in the log is:
awx.main.dispatch job 56100 (error) encountered an error (rc=None), please see task stdout for details.
also tested it with the pause
module, same result.
Seems if there is no output to AWX from the playbook/pod for 5 minutes, there is a somekind of timeout mechanism and the pod gets terminated. I think these 'issues', or at least some of the comments, are also related to this:
i have the k8s cluster in AKS. i checked the loadbalancer - idle timeout is set to default (30 minutes) I checked AWX settings: Default Job Idle Timeout - Set to 0 - also tried with a higher number (1800)
looks like this may be somehow related to AKS, my AWX is also running in AKS. what kubernetes version are you running? I've now 1.22.6 and, not totally sure, we didn't have this issue with 1.21.7
We are using version 1.23.x, recently upgraded, but before the upgrade we had also this issue, could be this issue started at version 1.22.x ..
I have checked all the settings in AKS. the only timeout settings, i already mentioned and are set way behind the 5 minute threshold.. Maybe you could check your timeout settings of the LB in the AKS resourcegroup, as mentioned here: https://github.com/ansible/awx/issues/12297#issuecomment-1151234788
I was interested in this issue and I decided to set up a single node AKS cluster to try this issue out, and sure enough, the issue reproduced indeed. The automation-job pod will be killed around 5 minutes after its starting.
$ kubectl -n awx describe automation-job-4-vqsht
Name: automation-job-4-vqsht
Namespace: awx
Priority: 0
Node: aks-agentpool-13146441-vmss000000/10.224.0.4
Start Time: Wed, 20 Jul 2022 12:55:24 +0000
Labels: ansible-awx=12e4dc0b-28c5-418d-b7a8-30803b17dfad
ansible-awx-job-id=4
Annotations: <none>
Status: Terminating (lasts <invalid>)
Termination Grace Period: 30s
IP: 10.244.0.14
IPs:
IP: 10.244.0.14
Containers:
worker:
Container ID: containerd://72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a
Image: quay.io/ansible/awx-ee:latest
Image ID: quay.io/ansible/awx-ee@sha256:833fcdf5211265040de6c4509130f04555f8a48548620172b73dfad949d6ea0c
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m10s default-scheduler Successfully assigned awx/automation-job-4-vqsht to aks-agentpool-13146441-vmss000000
Normal Pulling 5m9s kubelet Pulling image "quay.io/ansible/awx-ee:latest"
Normal Pulled 5m9s kubelet Successfully pulled image "quay.io/ansible/awx-ee:latest" in 782.62126ms
Normal Created 5m9s kubelet Created container worker
Normal Started 5m9s kubelet Started container worker πππ
Normal Killing 1s kubelet Stopping container worker πππ
In the log of the kubelet on the AKS node, "superfluous" http responses are logged, just before starting the termination for the automation-job pod.
Jul 20 13:00:04 aks-agentpool-13146441-vmss000000 update_certs.sh[26710]: done.
Jul 20 13:00:04 aks-agentpool-13146441-vmss000000 systemd[1]: Started Updates certificates copied from AKS DS.
πJul 20 13:00:31 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:31.682822 18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:33.518567 18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-4-vqsht]
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:33.518822 18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-4-vqsht" podUID=7632b872-b410-47d4-aac5-0c5bf4d718e8 containerName="worker" containerID="containerd://72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a" gracePeriod=30
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:00:33.519637681Z" level=info msg="StopContainer for \"72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a\" with timeout 30 (s)"
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:00:33.520551199Z" level=info msg="Stop container \"72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a\" with signal terminated"
...
...
...
Jul 20 13:13:09 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:09.452886 18003 logs.go:319] "Finished parsing log file" path="/var/log/pods/kube-system_kube-proxy-cd77k_2bb740b0-2a76-47f6-a11f-e376e1541e3b/kube-proxy/0.log"
Jul 20 13:13:27 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:27.836405 18003 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="awx/awx-operator-controller-manager-86c5c8cb86-cd65l" secret="" err="secret \"redhat-operators-pull-secret\" not found"
πJul 20 13:13:58 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:58.528313 18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:14:00.579829 18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-5-ccmfl]
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:14:00.580066 18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-5-ccmfl" podUID=61f7a7b1-2789-4719-a399-b7783c7db6ea containerName="worker" containerID="containerd://4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40" gracePeriod=30
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:14:00.580732530Z" level=info msg="StopContainer for \"4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40\" with timeout 30 (s)"
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:14:00.581736647Z" level=info msg="Stop container \"4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40\" with signal terminated"
...
...
...
Jul 20 13:20:04 aks-agentpool-13146441-vmss000000 update_certs.sh[17902]: done.
Jul 20 13:20:04 aks-agentpool-13146441-vmss000000 systemd[1]: Started Updates certificates copied from AKS DS.
πJul 20 13:20:17 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:17.150600 18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:19.182475 18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-6-cwl4q]
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:19.182738 18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-6-cwl4q" podUID=0f79b2be-b074-4f8e-a85c-fe0f8945bfb5 containerName="worker" containerID="containerd://18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16" gracePeriod=30
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:20:19.183387158Z" level=info msg="StopContainer for \"18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16\" with timeout 30 (s)"
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:20:19.184326376Z" level=info msg="Stop container \"18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16\" with signal terminated"
...
And by increasing log level for kubelet, the resp=400
for /containerLogs/awx/automation-job-8-vhnqn/worker?follow=true
are recorded.
Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.468573 27749 shared_informer.go:270] caches populated
Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.472238 27749 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πJul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.472287 27749 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-8-vhnqn/worker?follow=true" latency="5m5.602241567s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:39456" resp=400
Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.569091 27749 shared_informer.go:270] caches populated
...
...
...
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.113706 23353 shared_informer.go:270] caches populated
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.148243 23353 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πJul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.148298 23353 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-9-hbz7v/worker?follow=true" latency="5m5.387031969s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:54404" resp=400
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.213989 23353 shared_informer.go:270] caches populated
...
My understanding is that this log is opened by Receptor to follow output from pods by Stream.
I still don't know the root cause of this 400 by kubelet after all, but is it possible that the Receptor got 400 from API, which disconnected Receptor's Stream, leading to the termination of the job?
F.Y.I. for digging;
To create an interactive shell connection to an AKS node: https://docs.microsoft.com/en-us/azure/aks/node-access
kubectl debug node/<node> -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
chroot /host
bash
Increasing log level for kubelet on AKS node:
# vi /etc/systemd/system/kubelet.service
...
ExecStart=/usr/local/bin/kubelet \
--enable-server \
--node-labels="${KUBELET_NODE_LABELS}" \
--v=4 \ πππ
--volume-plugin-dir=/etc/kubernetes/volumeplugins \
$KUBELET_TLS_BOOTSTRAP_FLAGS \
$KUBELET_CONFIG_FILE_FLAGS \
...
# systemctl daemon-reload
# systemctl restart kubelet
I've already looked at the kublet logs but I'm not sure if that message is related to our issue because this "superfluous" http response is also logged for successful jobs...
I guess every playbook in awx will get the http response to kill the automation pod
Question is, what is sending this after the 5 minutes (timeout)? Cause this only happens with AKS, it needs to be something within Azure...?
@kurokobo, thank you for your time, confirmation and investigation! May i ask what version of k8s were you running in AKS?
@cmasopust @Parkhost Oh, thanks for letting me know. Sorry for my lack of investigation, you're right, I get the same logs on jobs that succeed. However, still I am a little suspicious of this log, since the order of the logs seems to differ between success and failure.
In successful case, an "superfluous" response is recorded after the termination process for container and pod is initiated.
13:20:05.859584 4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" (aliases: [b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22 /kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22], namespace: "containerd")
...
13:20:06.206528 4474 generic.go:296] "Generic (PLEG): container finished" podID=b86aae97-f997-4bf5-a722-9d6a306b1c43 containerID="b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" exitCode=0
...
13:20:06.206725 4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-4-k4k5p" oldPhase=Running phase=Succeeded
13:20:06.206779 4474 kubelet.go:1546] "syncPod exit" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 isTerminal=true
13:20:06.206796 4474 pod_workers.go:978] "Pod is terminal" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 updateType=0
13:20:06.206809 4474 pod_workers.go:1023] "Pod indicated lifecycle completed naturally and should now terminate" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
...
13:20:06.228010 4474 status_manager.go:682] "Patch status for pod" pod="awx/automation-job-4-k4k5p" patch="{\"metadata\":{\"uid\":\"b86aae97-f997-4bf5-a722-9d6a306b1c43\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"reason\":\"PodCompleted\",\"type\":\"Initialized\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"image\":\"quay.io/ansible/awx-ee:latest\",\"imageID\":\"quay.io/ansible/awx-ee@sha256:cd8c98f825884cfb2e7765842755bde165d22bb3bb8f637646e8483391af9921\",\"lastState\":{},\"name\":\"worker\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"exitCode\":0,\"finishedAt\":\"2022-07-21T13:20:05Z\",\"reason\":\"Completed\",\"startedAt\":\"2022-07-21T13:18:59Z\"}}}]}}"
...
13:20:06.228577 4474 kubelet.go:2117] "SyncLoop RECONCILE" source="api" pods=[awx/automation-job-4-k4k5p]
π13:20:06.528412 4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
π13:20:06.528468 4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-4-k4k5p/worker?follow=true" latency="1m6.436194772s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.12:49648" resp=400
13:20:06.536103 4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-4-k4k5p]
...
13:20:07.208846 4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
13:20:07.212456 4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-4-k4k5p"
...
13:20:07.213146 4474 kubelet.go:1817] "Pod terminating with grace period" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 gracePeriod=<nil>
However, in the cases where it fails, the "superfluous" response seem to be at the beginning of termination for the pod. After this log, pod is marked for graceful deletion.
π13:28:13.144435 4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
π13:28:13.144497 4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-6-rhs8w/worker?follow=true" latency="5m5.473488996s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:33806" resp=400
...
13:28:14.755714 4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-6-rhs8w]
13:28:14.755747 4474 pod_workers.go:625] "Pod is marked for graceful deletion, begin teardown" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755790 4474 pod_workers.go:888] "Processing pod event" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 updateType=1
13:28:14.755806 4474 pod_workers.go:1005] "Pod worker has observed request to terminate" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755819 4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755831 4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-6-rhs8w"
13:28:14.755870 4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-6-rhs8w" oldPhase=Running phase=Running
13:28:14.756073 4474 kubelet.go:1815] "Pod terminating with grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 gracePeriod=30
13:28:14.756148 4474 kuberuntime_container.go:719] "Killing container with a grace period override" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
13:28:14.756165 4474 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
...
13:28:14.763017 4474 kubelet_pods.go:939] "Pod is terminated, but some containers are still running" pod="awx/automation-job-6-rhs8w"
...
13:28:18.864941 4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" (aliases: [0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac /kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac], namespace: "containerd")
For someone who interested in this, I attached the full logs (/var/log/messages) including kubelet with loging level --v=4
on AKS node.
These are pod info for my job. They are available in the log.
@Parkhost I confirmed this using 1.22.11. Agree, I think this is AKS related thing...
It seems that AKS, at least a few years ago, had a behavior that terminates the connection to the API server if no data is sent for a certain period of time, even if it is within a cluster: https://github.com/Azure/AKS/issues/1755 I am not familiar with AKS, but similar restrictions may have been added in recent AKS...?
In fact, running ansible.builtin.pause
with minutes: 10
reproduces this issue, but repeating task with minute: 1
for 10 times in one job does not reproduce issue.
Thus, a workaround would seem to be to have the EE pod periodically send some data to the stream, like keep-alive.
I modified the entrypoint on EE to keep sleep 60
and echo 'KEEP JOB ALIVE' > /proc/1/fd/1
running in the background (I know it's a dirty way) and my job with long pause
no longer interrupted, but sadly the exit status of the job is incorrect. Not a good workaround.
can you show me how you modified the entrypoint? I tried something similar: just connected to the pod and run the echo command there. but... it is the exit status of the task that is then incorrect and therefore the whole playbook stops then (have you tried having an additional task following the pause task in the playbook?)
Didn't expect to work.. I have created a new AKS cluster with version 1.24, and seems the automation pod doesn't terminate at the 5 minute timeout.. Am skeptic about it, don't think MS even knows this bug exists in version 1.22 en 1.23, and as 1.24 is still in preview (in Azure) am not sure if it is a valid option.. maybe it breaks other things? On the other side; AWX isn't meant to run in production either π
Some prove..
Cluster info:
NAME STATUS ROLES AGE VERSION OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-offload-38942402-vmss00000d Ready agent 2d20h v1.24.0 Ubuntu 18.04.6 LTS 5.4.0-1085-azure containerd://1.6.4+azure-4
Pod info:
Name: automation-job-7-8n5rr
Namespace: awx
Priority: 0
Node: aks-offload-38942402-vmss00000d/10.1.0.7
Start Time: Fri, 22 Jul 2022 10:40:46 +0200
Labels: ansible-awx=8b9bae95-621f-414d-bc52-4cdf3a010e99
ansible-awx-job-id=7
Annotations: <none>
Status: Running
IP: 192.168.100.13
IPs:
IP: 192.168.100.13
Containers:
worker:
Container ID: containerd://fbfeac1d07d32c8cb5587ea7b200ff5ae712e67a50c33fc27317d28b03ffd077
Image: quay.io/ansible/awx-ee:latest
Image ID: quay.io/ansible/awx-ee@sha256:181ffd50d3cc08599d1452c5eb6624e217c7ff04a5741fa970c37c2905e1a2c8
Port: <none>
Host Port: <none>
Args:
ansible-runner
worker
--private-data-dir=/runner
State: Running
Started: Fri, 22 Jul 2022 10:40:47 +0200
Ready: True
Restart Count: 0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned awx/automation-job-7-8n5rr to aks-offload-38942402-vmss00000d
Normal Pulling 31m kubelet Pulling image "quay.io/ansible/awx-ee:latest"
Normal Pulled 31m kubelet Successfully pulled image "quay.io/ansible/awx-ee:latest" in 522.434807ms
Normal Created 31m kubelet Created container worker
Normal Started 31m kubelet Started container worker
pod is still going strong, guess he gonna reach the magic number of 69
:D
@kurokobo , how did you capture the 'describe' of the pod after it is stopped / terminated, tried to get the describe output after the play is finished but kubectl errors, cause the pod doesn't exists anymore? Would you share your kubectl voodoo?
@cmasopust
It seems that the string that should be echo
ed must be JSON that can be processed by Runner on awx-task e.g. {"uuid": "keepalive","counter": 0}
.
Finally, I was able to get a long job (with multiple tasks after pause
) to work with the following hack.
However, I think this is an unreliable workaround. I've not tested enough but this is because if the original output from the Runner and output from I believe this is not the smartest way, but it seems to be a workaround that works anyway.echo
are done at the same time, the two JSON may get mixed up then this causes job failure.
Here is my workaround:
Ideally, something like keep-alive would be implemented in Ansible Runner so that some string is always safely output at regular intervals.
@Parkhost Thanks for testing on 1.24. I am surprised too it works with 1.24.
It works well on AWS and on-prem K3s, so I don't think we can say it is not suitable for production, however, there should be many AKS users, though. I have no idea, but does it work the same with Ansible Automation Controller?
how did you capture the 'describe' of the pod after it is stopped / terminated
Nothing special and no trick, simply invoke kubectl describe
command repeatedly in high frequencyπ If the timing is right, we get the data of the pod in Terminating
state.
I upgraded the 'production' AKS cluster from 1.23.8 to 1.24.0 , but it didn't solved the issue as my newly created cluster (still error after +/- 5 minutes) π
Maybe it has something to do with the network profile settings?
My "timeout/failing" cluster is using: Type (plugin): Azure CNI Network policy: Azure
While my "success" cluster is using: Type (plugin): Kubenet Network policy: None
Going to try it out over the weekend.
@Parkhost I don't think it's related to the Kubernetes network. We first had a cluster with Azure CNI and Kubernetes version 1.21.x where no timeout happened. Then our Azure team redeployed the cluster with Kubenet and version 1.22.6 and we now have the timeout.
@kurokobo : thanks a lot for providing detailed instructions (for someone who's at the beginning of his Kubernetes journey π)
I've implemented your workaround and can confirm that my long running job now succeeds. But.. when looking at the awx-task logs I see the following errors each time the pod sends the 'keep-alive':
` 2022-07-23 08:49:53,444 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1) 2022-07-23 08:49:53,444 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1) 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Callback Task Processor Raised Exception: ZeroDivisionError('float division by zero') 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Callback Task Processor Raised Exception: ZeroDivisionError('float division by zero') 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Detail: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 258, in perform_work self.flush(force=flush) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 200, in flush metrics_total_job_event_processing_seconds.total_seconds() ZeroDivisionError: float division by zero
2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Detail: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 258, in perform_work self.flush(force=flush) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 200, in flush metrics_total_job_event_processing_seconds.total_seconds() ZeroDivisionError: float division by zero `
So I think we should try to find out what the receiver is expecting, maybe there's some noop data we can send to it. I'll try to find out more during the weekend.
Looks like adding "event": "FLUSH" to the json data does no longer produce the errors in awx-task and the playbook still succeeds.
Having a quick look at the AWX code shows that an event of type FLUSH skips any further processing of the received data and therefore produces no exception π (see awx/main/dispatch/worker/callback.py)
But, if that's really a good idea or the right event we should send for a keep-alive can only be answered by the AWX devs
Have some bad news: although the workaround works perfectly well now for playbook runs, it breaks the inventory updates! So, only half of my problems are solved π
@cmasopust
Thanks for updating and good finding for FLUSH
. Confirmed that the echo
breaks Inventory Sync, and found appending "end_line": 0
make it works.
echo '{"event": "FLUSH", "uuid": "keepalive", "counter": 0, "end_line": 0}' > /proc/1/fd/1
I've updated my entrypoint
script in my comment above: https://github.com/ansible/awx/issues/12530#issuecomment-1192616101
Also, for my following concern in my previous comment, removing sleep and allowing as many repeated echoes as possible did not corrupt the output JSON (tried dozens of times).
However, I think this is an unreliable workaround. I've not tested enough but this is because if the original output from the Runner and output from
echo
are done at the same time, the two JSON may get mixed up then this causes job failure.
So, I believe it's not the smartest way, but it seems to be a workaround that works anyway.
Of course, the best way to handle this is to support long-term stream on the AKS side by making some changes. Hope Parkhost find something π
@kurokobo you're my hero π can confirm that the inventory update is now working in my AWX too! Thanks a lot for your great help!!
We also have this issue on our AKS test (running Kubernetes 1.24) and production (running Kubernetes 1.23.8) cluster. We applied the workaround supplied by @kurokobo and I can also confirm that our jobs are now finishing as expected. Great! We are running Ansible AWX 21.0.0 with Ansible version core 2.13.1
AWX: 21.3.0
Kubernetes Client Version: v1.24.2
Kubernetes Server Version: v1.24.3
Kustomize Version: v4.5.4
Create new Container Group and put this YAML as custom pod spec.
I got very confused here as I am new both to Kube and AWX and tried creating a hotfix.yaml and applying that via kubectl - that's completely wrong and what's actually needed needed is to set a new Instance Group under AWX Administration tab and make sure the templates/syncs use the new instance group.
Unfortunately for me, that did not resolve my issue, probably because in my case the automation-job does not get killed after 5 minutes, it just breaks after 5-6.5 and gets killed about 2 minutes after that.
While in the 'broken' state it stops outputting in the AWX console (at a random point in the playbook each time) but I see it still seems to run the playbook for a while when I observe the process list on the Kubernetes host.
@C0rn3j Hi, according to the image that you've attached before, your Job seems to have a lot of output over 4000 lines, so I guess it is a log size issue.
AWX has an issue with jobs aborting if Kubelet log rotation occurs while the job is running: https://github.com/ansible/awx/issues/10366 and https://github.com/ansible/awx/issues/11338
Kubelet in AKS has a default log size limit of 10 MB, so if the size of the log of the automation-job pod exceeds this, the job will be terminated.
You can check the actual log size under /var/log/pods/<pod>/<container>/*.log
on your AKS node by following this documentation, and can increase containerLogMaxSizeMB
by recreating node pool with kubeletconfig.json
.
In this (longish) comment I will address the 5 minute timeout issue in AKS only! (Not the log size issue, or the 4 hours timeout issue.)
As Microsoft partners, we have escalated this issue and engaged engineers to identify the root cause and discuss options. The issue relates to the method by which the Azure managed control plane is made available to the customer managed vnet containing the worker nodes and pods. In previous versions they have relied on tunneld/openvpn to build the tunnel to the vnet, but upon recommendation from the k8s community, they are now using Konnectivity. This issue is AKS specific because they added some custom code to add a 5 minute timeout as a precaution to avoid leaking connections. As this Github issue shows, this issue is very difficult for end-user/developers to diagnose, as the problem appears to be with the API server, but it is actually with the tunnel to the API server, which is not visible to customers.
A Microsoft engineer has confirmed this diagnosis with this Pause playbook; noting that it fails after 5 minutes, and then adjusting the Konnectivity timeout to 7 minutes, and confirming that it fails at 7 minutes.
The feedback we received was that adjusting this timeout would have performance impact across clusters using Konnectivity, and therefore will not be done. The AKS Konnectivity feature owners suggest that AWX should be sending keepalives and/or gracefully handling terminated http2 streams instead.
Alternatively, the API Server VNet Integration preview feature is available (albeit currently only in USA regions) which instead of setting up a tunnel:
...projects the API server endpoint directly into a delegated subnet in the VNet where AKS is deployed. This enables network communication between the API server and the cluster nodes without any required private link or tunnel."
The Microsoft engineer has confirmed with the same Pause playbook that the issue does not occur when AKS is deployed with this preview feature.
Additional links from our ticket:
@kurokobo thank you for your workaround. You saved me! :)
I can confirm that is working on AWX version 21.6.0 and AKS 1.23.12.
@kurokobo I also thank you for your workaround. It was a life saver. But I upgraded to AKS 1.25.4 and AWX version 21.11.0 and the problem seems resolved now.
Native periodical keep-alive message has been implemented in Ansible Runner 2.3.2, and AWX has been updated to use it in 21.14.0.
To activate keep-alive message:
quay.io/ansible/awx-ee:latest
, or quay.io/ansible/awx-ee:21.14.0
or later as EE for the jpbK8S Ansible Runner Keep-Alive Message Interval
in Settings
> Job settings
in AWX@AlanCoding @nitzmahone Thanks for implementing that feature π I believe this issue can be closed now.
Please confirm the following
Bug Summary
In my AWX I have an inventory that is synchronized from my Katello server, since some time the synchronization needs appr. 9 minutes (tested that offline with the ansible-inventory command).
When I now try to synchronize the inventory in AWX, the job starts but then it is killed after 5 minutes.
I already changed the "Default Inventory Update Timeout" to 900s, but the synchronization job still terminates after 5 minutes.
AWX version
21.3.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
2.13.1
Operating system
No response
Web browser
Chrome
Steps to reproduce
Synchronize Satellite (Katello) repository, Katello contains appr. 500 hosts
Expected results
Repository synchronized successfully
Actual results
Synchronization is terminated after 5 minutes
Additional information
No response