ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.12k stars 3.44k forks source link

Jobs with long running tasks and Inventory sync terminating after 5 minutes #12530

Closed cmasopust closed 1 year ago

cmasopust commented 2 years ago

Please confirm the following

Bug Summary

In my AWX I have an inventory that is synchronized from my Katello server, since some time the synchronization needs appr. 9 minutes (tested that offline with the ansible-inventory command).

When I now try to synchronize the inventory in AWX, the job starts but then it is killed after 5 minutes.

I already changed the "Default Inventory Update Timeout" to 900s, but the synchronization job still terminates after 5 minutes.

AWX version

21.3.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

2.13.1

Operating system

No response

Web browser

Chrome

Steps to reproduce

Synchronize Satellite (Katello) repository, Katello contains appr. 500 hosts

Expected results

Repository synchronized successfully

Actual results

Synchronization is terminated after 5 minutes

Additional information

No response

cmasopust commented 2 years ago

it's not only the inventory that times out after 5 minutes, any job that has long running tasks (no matter whether they produce output or not, e.g. a simple pause for more than 5 minutes) are killed after 5 minutes.

AlanCoding commented 2 years ago

Can you find any related logs from around the time it timed out? What information do the API details show at /api/v2/inventory_updates/(id)/?

cmasopust commented 2 years ago

There is nothing in the job logs that shows any error. When running this simple playbook: `---

Did now a strace of the ansible-runner process in the execution environment and found that, after 5 minutes, it gets a SIGTERM.

cmasopust commented 2 years ago

Did some further strace of the processes in the execution environment pod and found that the "init process" dumb-init is receiving the SIGTERM signal, so it looks like the pod is terminated from "outside". So the question remains: who's terminating my pod? Is it AWX or is it Kubernetes?

rudesome commented 2 years ago

I got the same problem, all plays which take longer as 5 minutes (without posting any output) are stopping with the status: Error

Only thing in the log is: awx.main.dispatch job 56100 (error) encountered an error (rc=None), please see task stdout for details.

also tested it with the pause module, same result.

Seems if there is no output to AWX from the playbook/pod for 5 minutes, there is a somekind of timeout mechanism and the pod gets terminated. I think these 'issues', or at least some of the comments, are also related to this:

12297

11484

i have the k8s cluster in AKS. i checked the loadbalancer - idle timeout is set to default (30 minutes) I checked AWX settings: Default Job Idle Timeout - Set to 0 - also tried with a higher number (1800)

cmasopust commented 2 years ago

looks like this may be somehow related to AKS, my AWX is also running in AKS. what kubernetes version are you running? I've now 1.22.6 and, not totally sure, we didn't have this issue with 1.21.7

rudesome commented 2 years ago

We are using version 1.23.x, recently upgraded, but before the upgrade we had also this issue, could be this issue started at version 1.22.x ..

I have checked all the settings in AKS. the only timeout settings, i already mentioned and are set way behind the 5 minute threshold.. Maybe you could check your timeout settings of the LB in the AKS resourcegroup, as mentioned here: https://github.com/ansible/awx/issues/12297#issuecomment-1151234788

kurokobo commented 2 years ago

I was interested in this issue and I decided to set up a single node AKS cluster to try this issue out, and sure enough, the issue reproduced indeed. The automation-job pod will be killed around 5 minutes after its starting.

$ kubectl -n awx describe automation-job-4-vqsht
Name:                      automation-job-4-vqsht
Namespace:                 awx
Priority:                  0
Node:                      aks-agentpool-13146441-vmss000000/10.224.0.4
Start Time:                Wed, 20 Jul 2022 12:55:24 +0000
Labels:                    ansible-awx=12e4dc0b-28c5-418d-b7a8-30803b17dfad
                           ansible-awx-job-id=4
Annotations:               <none>
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        10.244.0.14
IPs:
  IP:  10.244.0.14
Containers:
  worker:
    Container ID:  containerd://72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a
    Image:         quay.io/ansible/awx-ee:latest
    Image ID:      quay.io/ansible/awx-ee@sha256:833fcdf5211265040de6c4509130f04555f8a48548620172b73dfad949d6ea0c
...
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m10s  default-scheduler  Successfully assigned awx/automation-job-4-vqsht to aks-agentpool-13146441-vmss000000
  Normal  Pulling    5m9s   kubelet            Pulling image "quay.io/ansible/awx-ee:latest"
  Normal  Pulled     5m9s   kubelet            Successfully pulled image "quay.io/ansible/awx-ee:latest" in 782.62126ms
  Normal  Created    5m9s   kubelet            Created container worker
  Normal  Started    5m9s   kubelet            Started container worker     πŸ‘ˆπŸ‘ˆπŸ‘ˆ
  Normal  Killing    1s     kubelet            Stopping container worker     πŸ‘ˆπŸ‘ˆπŸ‘ˆ

In the log of the kubelet on the AKS node, "superfluous" http responses are logged, just before starting the termination for the automation-job pod.

Jul 20 13:00:04 aks-agentpool-13146441-vmss000000 update_certs.sh[26710]: done.
Jul 20 13:00:04 aks-agentpool-13146441-vmss000000 systemd[1]: Started Updates certificates copied from AKS DS.
πŸ‘‰Jul 20 13:00:31 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:31.682822   18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:33.518567   18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-4-vqsht]
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:00:33.518822   18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-4-vqsht" podUID=7632b872-b410-47d4-aac5-0c5bf4d718e8 containerName="worker" containerID="containerd://72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a" gracePeriod=30
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:00:33.519637681Z" level=info msg="StopContainer for \"72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a\" with timeout 30 (s)"
Jul 20 13:00:33 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:00:33.520551199Z" level=info msg="Stop container \"72b5fe1ac8c1a827755c0647ae25dfb9a8037dd40c434f10e416a1329f52bf6a\" with signal terminated"
...
...
...
Jul 20 13:13:09 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:09.452886   18003 logs.go:319] "Finished parsing log file" path="/var/log/pods/kube-system_kube-proxy-cd77k_2bb740b0-2a76-47f6-a11f-e376e1541e3b/kube-proxy/0.log"
Jul 20 13:13:27 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:27.836405   18003 kubelet_pods.go:898] "Unable to retrieve pull secret, the image pull may not succeed." pod="awx/awx-operator-controller-manager-86c5c8cb86-cd65l" secret="" err="secret \"redhat-operators-pull-secret\" not found"
πŸ‘‰Jul 20 13:13:58 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:13:58.528313   18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:14:00.579829   18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-5-ccmfl]
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:14:00.580066   18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-5-ccmfl" podUID=61f7a7b1-2789-4719-a399-b7783c7db6ea containerName="worker" containerID="containerd://4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40" gracePeriod=30
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:14:00.580732530Z" level=info msg="StopContainer for \"4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40\" with timeout 30 (s)"
Jul 20 13:14:00 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:14:00.581736647Z" level=info msg="Stop container \"4e538d3ec29f5852cee05d86735293d2d31a2180761b69fcf1a5a700b8ebcf40\" with signal terminated"
...
...
...
Jul 20 13:20:04 aks-agentpool-13146441-vmss000000 update_certs.sh[17902]: done.
Jul 20 13:20:04 aks-agentpool-13146441-vmss000000 systemd[1]: Started Updates certificates copied from AKS DS.
πŸ‘‰Jul 20 13:20:17 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:17.150600   18003 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:19.182475   18003 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-6-cwl4q]
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 kubelet[18003]: I0720 13:20:19.182738   18003 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-6-cwl4q" podUID=0f79b2be-b074-4f8e-a85c-fe0f8945bfb5 containerName="worker" containerID="containerd://18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16" gracePeriod=30
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:20:19.183387158Z" level=info msg="StopContainer for \"18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16\" with timeout 30 (s)"
Jul 20 13:20:19 aks-agentpool-13146441-vmss000000 containerd[17698]: time="2022-07-20T13:20:19.184326376Z" level=info msg="Stop container \"18343fc7c8bb8b3fe3f4be625e1011c9998337715eda38cd6d3a99eda1585a16\" with signal terminated"
...

And by increasing log level for kubelet, the resp=400 for /containerLogs/awx/automation-job-8-vhnqn/worker?follow=true are recorded.

Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.468573   27749 shared_informer.go:270] caches populated
Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.472238   27749 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πŸ‘‰Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.472287   27749 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-8-vhnqn/worker?follow=true" latency="5m5.602241567s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:39456" resp=400
Jul 20 14:04:04 aks-agentpool-13146441-vmss000000 kubelet[27749]: I0720 14:04:04.569091   27749 shared_informer.go:270] caches populated
...
...
...
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.113706   23353 shared_informer.go:270] caches populated
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.148243   23353 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πŸ‘‰Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.148298   23353 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-9-hbz7v/worker?follow=true" latency="5m5.387031969s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:54404" resp=400
Jul 20 14:25:11 aks-agentpool-13146441-vmss000000 kubelet[23353]: I0720 14:25:11.213989   23353 shared_informer.go:270] caches populated
...

My understanding is that this log is opened by Receptor to follow output from pods by Stream.

https://github.com/ansible/receptor/blob/05f3017f0a679187a7275a778bd75af1f696a3a0/pkg/workceptor/kubernetes.go#L295-L299

I still don't know the root cause of this 400 by kubelet after all, but is it possible that the Receptor got 400 from API, which disconnected Receptor's Stream, leading to the termination of the job?

kurokobo commented 2 years ago

F.Y.I. for digging;

To create an interactive shell connection to an AKS node: https://docs.microsoft.com/en-us/azure/aks/node-access

kubectl debug node/<node> -it --image=mcr.microsoft.com/dotnet/runtime-deps:6.0
chroot /host
bash

Increasing log level for kubelet on AKS node:

# vi /etc/systemd/system/kubelet.service
...
ExecStart=/usr/local/bin/kubelet \
        --enable-server \
        --node-labels="${KUBELET_NODE_LABELS}" \
        --v=4 \   πŸ‘ˆπŸ‘ˆπŸ‘ˆ
        --volume-plugin-dir=/etc/kubernetes/volumeplugins \
        $KUBELET_TLS_BOOTSTRAP_FLAGS \
        $KUBELET_CONFIG_FILE_FLAGS \
...

# systemctl daemon-reload
# systemctl restart kubelet
cmasopust commented 2 years ago

I've already looked at the kublet logs but I'm not sure if that message is related to our issue because this "superfluous" http response is also logged for successful jobs...

rudesome commented 2 years ago

I guess every playbook in awx will get the http response to kill the automation pod

Question is, what is sending this after the 5 minutes (timeout)? Cause this only happens with AKS, it needs to be something within Azure...?

@kurokobo, thank you for your time, confirmation and investigation! May i ask what version of k8s were you running in AKS?

kurokobo commented 2 years ago

@cmasopust @Parkhost Oh, thanks for letting me know. Sorry for my lack of investigation, you're right, I get the same logs on jobs that succeed. However, still I am a little suspicious of this log, since the order of the logs seems to differ between success and failure.

In successful case, an "superfluous" response is recorded after the termination process for container and pod is initiated.

13:20:05.859584    4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" (aliases: [b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22 /kubepods/burstable/podb86aae97-f997-4bf5-a722-9d6a306b1c43/b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22], namespace: "containerd")
...
13:20:06.206528    4474 generic.go:296] "Generic (PLEG): container finished" podID=b86aae97-f997-4bf5-a722-9d6a306b1c43 containerID="b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22" exitCode=0
...
13:20:06.206725    4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-4-k4k5p" oldPhase=Running phase=Succeeded
13:20:06.206779    4474 kubelet.go:1546] "syncPod exit" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 isTerminal=true
13:20:06.206796    4474 pod_workers.go:978] "Pod is terminal" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 updateType=0
13:20:06.206809    4474 pod_workers.go:1023] "Pod indicated lifecycle completed naturally and should now terminate" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
...
13:20:06.228010    4474 status_manager.go:682] "Patch status for pod" pod="awx/automation-job-4-k4k5p" patch="{\"metadata\":{\"uid\":\"b86aae97-f997-4bf5-a722-9d6a306b1c43\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"reason\":\"PodCompleted\",\"type\":\"Initialized\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2022-07-21T13:20:06Z\",\"reason\":\"PodCompleted\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"image\":\"quay.io/ansible/awx-ee:latest\",\"imageID\":\"quay.io/ansible/awx-ee@sha256:cd8c98f825884cfb2e7765842755bde165d22bb3bb8f637646e8483391af9921\",\"lastState\":{},\"name\":\"worker\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"containerID\":\"containerd://b6db6d5e42ac4ff209d65f5ef5af916ef51e1d3c0bce27603ab9f8a15beb7f22\",\"exitCode\":0,\"finishedAt\":\"2022-07-21T13:20:05Z\",\"reason\":\"Completed\",\"startedAt\":\"2022-07-21T13:18:59Z\"}}}]}}"
...
13:20:06.228577    4474 kubelet.go:2117] "SyncLoop RECONCILE" source="api" pods=[awx/automation-job-4-k4k5p]
πŸ‘‰13:20:06.528412    4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πŸ‘‰13:20:06.528468    4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-4-k4k5p/worker?follow=true" latency="1m6.436194772s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.12:49648" resp=400
13:20:06.536103    4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-4-k4k5p]
...
13:20:07.208846    4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43
13:20:07.212456    4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-4-k4k5p"
...
13:20:07.213146    4474 kubelet.go:1817] "Pod terminating with grace period" pod="awx/automation-job-4-k4k5p" podUID=b86aae97-f997-4bf5-a722-9d6a306b1c43 gracePeriod=<nil>

However, in the cases where it fails, the "superfluous" response seem to be at the beginning of termination for the pod. After this log, pod is marked for graceful deletion.

πŸ‘‰13:28:13.144435    4474 log.go:184] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:269)
πŸ‘‰13:28:13.144497    4474 httplog.go:109] "HTTP" verb="GET" URI="/containerLogs/awx/automation-job-6-rhs8w/worker?follow=true" latency="5m5.473488996s" userAgent="Go-http-client/1.1" audit-ID="" srcIP="10.244.0.11:33806" resp=400
...
13:28:14.755714    4474 kubelet.go:2120] "SyncLoop DELETE" source="api" pods=[awx/automation-job-6-rhs8w]
13:28:14.755747    4474 pod_workers.go:625] "Pod is marked for graceful deletion, begin teardown" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755790    4474 pod_workers.go:888] "Processing pod event" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 updateType=1
13:28:14.755806    4474 pod_workers.go:1005] "Pod worker has observed request to terminate" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755819    4474 kubelet.go:1785] "syncTerminatingPod enter" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18
13:28:14.755831    4474 kubelet_pods.go:1447] "Generating pod status" pod="awx/automation-job-6-rhs8w"
13:28:14.755870    4474 kubelet_pods.go:1459] "Got phase for pod" pod="awx/automation-job-6-rhs8w" oldPhase=Running phase=Running
13:28:14.756073    4474 kubelet.go:1815] "Pod terminating with grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 gracePeriod=30
13:28:14.756148    4474 kuberuntime_container.go:719] "Killing container with a grace period override" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
13:28:14.756165    4474 kuberuntime_container.go:723] "Killing container with a grace period" pod="awx/automation-job-6-rhs8w" podUID=d51c5853-65d4-4823-8ea5-27a89272af18 containerName="worker" containerID="containerd://0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" gracePeriod=30
...
13:28:14.763017    4474 kubelet_pods.go:939] "Pod is terminated, but some containers are still running" pod="awx/automation-job-6-rhs8w"
...
13:28:18.864941    4474 manager.go:1048] Destroyed container: "/kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac" (aliases: [0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac /kubepods/burstable/podd51c5853-65d4-4823-8ea5-27a89272af18/0929b4635502f8e8fc5d644992fd1c288398bc79c1c084ab4ae95e1cc3c995ac], namespace: "containerd")

For someone who interested in this, I attached the full logs (/var/log/messages) including kubelet with loging level --v=4 on AKS node.

These are pod info for my job. They are available in the log.

@Parkhost I confirmed this using 1.22.11. Agree, I think this is AKS related thing...

kurokobo commented 2 years ago

It seems that AKS, at least a few years ago, had a behavior that terminates the connection to the API server if no data is sent for a certain period of time, even if it is within a cluster: https://github.com/Azure/AKS/issues/1755 I am not familiar with AKS, but similar restrictions may have been added in recent AKS...?

In fact, running ansible.builtin.pause with minutes: 10 reproduces this issue, but repeating task with minute: 1 for 10 times in one job does not reproduce issue.

Thus, a workaround would seem to be to have the EE pod periodically send some data to the stream, like keep-alive.

I modified the entrypoint on EE to keep sleep 60 and echo 'KEEP JOB ALIVE' > /proc/1/fd/1 running in the background (I know it's a dirty way) and my job with long pause no longer interrupted, but sadly the exit status of the job is incorrect. Not a good workaround.

cmasopust commented 2 years ago

can you show me how you modified the entrypoint? I tried something similar: just connected to the pod and run the echo command there. but... it is the exit status of the task that is then incorrect and therefore the whole playbook stops then (have you tried having an additional task following the pause task in the playbook?)

rudesome commented 2 years ago

Didn't expect to work.. I have created a new AKS cluster with version 1.24, and seems the automation pod doesn't terminate at the 5 minute timeout.. Am skeptic about it, don't think MS even knows this bug exists in version 1.22 en 1.23, and as 1.24 is still in preview (in Azure) am not sure if it is a valid option.. maybe it breaks other things? On the other side; AWX isn't meant to run in production either πŸ˜…

Some prove..

Cluster info:

NAME                              STATUS   ROLES   AGE     VERSION  OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
aks-offload-38942402-vmss00000d   Ready    agent   2d20h   v1.24.0  Ubuntu 18.04.6 LTS   5.4.0-1085-azure   containerd://1.6.4+azure-4

Pod info:

Name:         automation-job-7-8n5rr
Namespace:    awx
Priority:     0
Node:         aks-offload-38942402-vmss00000d/10.1.0.7
Start Time:   Fri, 22 Jul 2022 10:40:46 +0200
Labels:       ansible-awx=8b9bae95-621f-414d-bc52-4cdf3a010e99
              ansible-awx-job-id=7
Annotations:  <none>
Status:       Running
IP:           192.168.100.13
IPs:
  IP:  192.168.100.13
Containers:
  worker:
    Container ID:  containerd://fbfeac1d07d32c8cb5587ea7b200ff5ae712e67a50c33fc27317d28b03ffd077
    Image:         quay.io/ansible/awx-ee:latest
    Image ID:      quay.io/ansible/awx-ee@sha256:181ffd50d3cc08599d1452c5eb6624e217c7ff04a5741fa970c37c2905e1a2c8
    Port:          <none>
    Host Port:     <none>
    Args:
      ansible-runner
      worker
      --private-data-dir=/runner
    State:          Running
      Started:      Fri, 22 Jul 2022 10:40:47 +0200
    Ready:          True
    Restart Count:  0
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  31m   default-scheduler  Successfully assigned awx/automation-job-7-8n5rr to aks-offload-38942402-vmss00000d
  Normal  Pulling    31m   kubelet            Pulling image "quay.io/ansible/awx-ee:latest"
  Normal  Pulled     31m   kubelet            Successfully pulled image "quay.io/ansible/awx-ee:latest" in 522.434807ms
  Normal  Created    31m   kubelet            Created container worker
  Normal  Started    31m   kubelet            Started container worker

pod is still going strong, guess he gonna reach the magic number of 69 :D awx

@kurokobo , how did you capture the 'describe' of the pod after it is stopped / terminated, tried to get the describe output after the play is finished but kubectl errors, cause the pod doesn't exists anymore? Would you share your kubectl voodoo?

kurokobo commented 2 years ago

@cmasopust It seems that the string that should be echoed must be JSON that can be processed by Runner on awx-task e.g. {"uuid": "keepalive","counter": 0}. Finally, I was able to get a long job (with multiple tasks after pause) to work with the following hack.

However, I think this is an unreliable workaround. I've not tested enough but this is because if the original output from the Runner and output from echo are done at the same time, the two JSON may get mixed up then this causes job failure. I believe this is not the smartest way, but it seems to be a workaround that works anyway.

Here is my workaround:

Click to Expand ### Entrypoint script Save as `entrypoint`. This is a file that a `while` loop (4 lines) is appended to [`entrypoint.sh` on ansible-runner](https://github.com/ansible/ansible-runner/blob/2.1.3/utils/entrypoint.sh). ```bash #!/usr/bin/env bash # In OpenShift, containers are run as a random high number uid # that doesn't exist in /etc/passwd, but Ansible module utils # require a named user. So if we're in OpenShift, we need to make # one before Ansible runs. if [[ (`id -u` -ge 500 || -z "${CURRENT_UID}") ]]; then # Only needed for RHEL 8. Try deleting this conditional (not the code) # sometime in the future. Seems to be fixed on Fedora 32 # If we are running in rootless podman, this file cannot be overwritten ROOTLESS_MODE=$(cat /proc/self/uid_map | head -n1 | awk '{ print $2; }') if [[ "$ROOTLESS_MODE" -eq "0" ]]; then cat << EOF > /etc/passwd root:x:0:0:root:/root:/bin/bash runner:x:`id -u`:`id -g`:,,,:/home/runner:/bin/bash EOF fi cat < /etc/group root:x:0:runner runner:x:`id -g`: EOF fi if [[ -n "${LAUNCHED_BY_RUNNER}" ]]; then RUNNER_CALLBACKS=$(python3 -c "import ansible_runner.callbacks; print(ansible_runner.callbacks.__file__)") # TODO: respect user callback settings via # env ANSIBLE_CALLBACK_PLUGINS or ansible.cfg export ANSIBLE_CALLBACK_PLUGINS="$(dirname $RUNNER_CALLBACKS)" fi if [[ -d ${AWX_ISOLATED_DATA_DIR} ]]; then if output=$(ansible-galaxy collection list --format json 2> /dev/null); then echo $output > ${AWX_ISOLATED_DATA_DIR}/collections.json fi ansible --version | head -n 1 > ${AWX_ISOLATED_DATA_DIR}/ansible_version.txt fi SCRIPT=/usr/local/bin/dumb-init # NOTE(pabelanger): Downstream we install dumb-init from RPM. if [ -f "/usr/bin/dumb-init" ]; then SCRIPT=/usr/bin/dumb-init fi while /bin/true; do sleep 120 echo '{"event": "FLUSH", "uuid": "keepalive", "counter": 0, "end_line": 0}' > /proc/1/fd/1 done & exec $SCRIPT -- "${@}" ``` ### Create configmap using `entrypoint` ```bash kubectl -n awx create configmap awx-runner-entrypoint --from-file=entrypoint ``` ### Define custom pod spec on AWX Create new Container Group and put this YAML as custom pod spec. ```yaml apiVersion: v1 kind: Pod metadata: namespace: awx spec: serviceAccountName: default automountServiceAccountToken: false containers: - image: quay.io/ansible/awx-ee:latest name: worker args: - ansible-runner - worker - '--private-data-dir=/runner' resources: requests: cpu: 250m memory: 100Mi volumeMounts: - mountPath: /usr/bin/entrypoint name: awx-runner-entrypoint-volume subPath: entrypoint volumes: - name: awx-runner-entrypoint-volume configMap: name: awx-runner-entrypoint items: - key: entrypoint path: entrypoint mode: 0775 ```

Ideally, something like keep-alive would be implemented in Ansible Runner so that some string is always safely output at regular intervals.

@Parkhost Thanks for testing on 1.24. I am surprised too it works with 1.24.

It works well on AWS and on-prem K3s, so I don't think we can say it is not suitable for production, however, there should be many AKS users, though. I have no idea, but does it work the same with Ansible Automation Controller?

how did you capture the 'describe' of the pod after it is stopped / terminated

Nothing special and no trick, simply invoke kubectl describe command repeatedly in high frequencyπŸ˜ƒ If the timing is right, we get the data of the pod in Terminating state.

rudesome commented 2 years ago

I upgraded the 'production' AKS cluster from 1.23.8 to 1.24.0 , but it didn't solved the issue as my newly created cluster (still error after +/- 5 minutes) πŸ˜•

rudesome commented 2 years ago

Maybe it has something to do with the network profile settings?

My "timeout/failing" cluster is using: Type (plugin): Azure CNI Network policy: Azure

While my "success" cluster is using: Type (plugin): Kubenet Network policy: None

Going to try it out over the weekend.

cmasopust commented 2 years ago

@Parkhost I don't think it's related to the Kubernetes network. We first had a cluster with Azure CNI and Kubernetes version 1.21.x where no timeout happened. Then our Azure team redeployed the cluster with Kubenet and version 1.22.6 and we now have the timeout.

cmasopust commented 2 years ago

@kurokobo : thanks a lot for providing detailed instructions (for someone who's at the beginning of his Kubernetes journey πŸ˜„)

I've implemented your workaround and can confirm that my long running job now succeeds. But.. when looking at the awx-task logs I see the following errors each time the pod sends the 'keep-alive':

` 2022-07-23 08:49:53,444 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1) 2022-07-23 08:49:53,444 DEBUG [-] awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(1) 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Callback Task Processor Raised Exception: ZeroDivisionError('float division by zero') 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Callback Task Processor Raised Exception: ZeroDivisionError('float division by zero') 2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Detail: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 258, in perform_work self.flush(force=flush) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 200, in flush metrics_total_job_event_processing_seconds.total_seconds() ZeroDivisionError: float division by zero

2022-07-23 08:49:53,460 ERROR [-] awx.main.commands.run_callback_receiver Detail: Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 258, in perform_work self.flush(force=flush) File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/callback.py", line 200, in flush metrics_total_job_event_processing_seconds.total_seconds() ZeroDivisionError: float division by zero `

So I think we should try to find out what the receiver is expecting, maybe there's some noop data we can send to it. I'll try to find out more during the weekend.

cmasopust commented 2 years ago

Looks like adding "event": "FLUSH" to the json data does no longer produce the errors in awx-task and the playbook still succeeds.

Having a quick look at the AWX code shows that an event of type FLUSH skips any further processing of the received data and therefore produces no exception πŸ˜„ (see awx/main/dispatch/worker/callback.py)

But, if that's really a good idea or the right event we should send for a keep-alive can only be answered by the AWX devs

cmasopust commented 2 years ago

Have some bad news: although the workaround works perfectly well now for playbook runs, it breaks the inventory updates! So, only half of my problems are solved πŸ˜’

kurokobo commented 2 years ago

@cmasopust Thanks for updating and good finding for FLUSH. Confirmed that the echo breaks Inventory Sync, and found appending "end_line": 0 make it works.

echo '{"event": "FLUSH", "uuid": "keepalive", "counter": 0, "end_line": 0}' > /proc/1/fd/1

I've updated my entrypoint script in my comment above: https://github.com/ansible/awx/issues/12530#issuecomment-1192616101

Also, for my following concern in my previous comment, removing sleep and allowing as many repeated echoes as possible did not corrupt the output JSON (tried dozens of times).

However, I think this is an unreliable workaround. I've not tested enough but this is because if the original output from the Runner and output from echo are done at the same time, the two JSON may get mixed up then this causes job failure.

So, I believe it's not the smartest way, but it seems to be a workaround that works anyway.

Of course, the best way to handle this is to support long-term stream on the AKS side by making some changes. Hope Parkhost find something πŸ˜ƒ

cmasopust commented 2 years ago

@kurokobo you're my hero πŸ˜„ can confirm that the inventory update is now working in my AWX too! Thanks a lot for your great help!!

ahatting commented 2 years ago

We also have this issue on our AKS test (running Kubernetes 1.24) and production (running Kubernetes 1.23.8) cluster. We applied the workaround supplied by @kurokobo and I can also confirm that our jobs are now finishing as expected. Great! We are running Ansible AWX 21.0.0 with Ansible version core 2.13.1

C0rn3j commented 2 years ago
AWX: 21.3.0
Kubernetes Client Version: v1.24.2
Kubernetes Server Version: v1.24.3
Kustomize Version: v4.5.4

Create new Container Group and put this YAML as custom pod spec.

I got very confused here as I am new both to Kube and AWX and tried creating a hotfix.yaml and applying that via kubectl - that's completely wrong and what's actually needed needed is to set a new Instance Group under AWX Administration tab and make sure the templates/syncs use the new instance group.

Unfortunately for me, that did not resolve my issue, probably because in my case the automation-job does not get killed after 5 minutes, it just breaks after 5-6.5 and gets killed about 2 minutes after that.

image

While in the 'broken' state it stops outputting in the AWX console (at a random point in the playbook each time) but I see it still seems to run the playbook for a while when I observe the process list on the Kubernetes host.

kurokobo commented 2 years ago

@C0rn3j Hi, according to the image that you've attached before, your Job seems to have a lot of output over 4000 lines, so I guess it is a log size issue.

AWX has an issue with jobs aborting if Kubelet log rotation occurs while the job is running: https://github.com/ansible/awx/issues/10366 and https://github.com/ansible/awx/issues/11338

Kubelet in AKS has a default log size limit of 10 MB, so if the size of the log of the automation-job pod exceeds this, the job will be terminated. You can check the actual log size under /var/log/pods/<pod>/<container>/*.log on your AKS node by following this documentation, and can increase containerLogMaxSizeMB by recreating node pool with kubeletconfig.json.

philipsd6 commented 2 years ago

In this (longish) comment I will address the 5 minute timeout issue in AKS only! (Not the log size issue, or the 4 hours timeout issue.)

As Microsoft partners, we have escalated this issue and engaged engineers to identify the root cause and discuss options. The issue relates to the method by which the Azure managed control plane is made available to the customer managed vnet containing the worker nodes and pods. In previous versions they have relied on tunneld/openvpn to build the tunnel to the vnet, but upon recommendation from the k8s community, they are now using Konnectivity. This issue is AKS specific because they added some custom code to add a 5 minute timeout as a precaution to avoid leaking connections. As this Github issue shows, this issue is very difficult for end-user/developers to diagnose, as the problem appears to be with the API server, but it is actually with the tunnel to the API server, which is not visible to customers.

A Microsoft engineer has confirmed this diagnosis with this Pause playbook; noting that it fails after 5 minutes, and then adjusting the Konnectivity timeout to 7 minutes, and confirming that it fails at 7 minutes.

The feedback we received was that adjusting this timeout would have performance impact across clusters using Konnectivity, and therefore will not be done. The AKS Konnectivity feature owners suggest that AWX should be sending keepalives and/or gracefully handling terminated http2 streams instead.

Alternatively, the API Server VNet Integration preview feature is available (albeit currently only in USA regions) which instead of setting up a tunnel:

...projects the API server endpoint directly into a delegated subnet in the VNet where AKS is deployed. This enables network communication between the API server and the cluster nodes without any required private link or tunnel."

The Microsoft engineer has confirmed with the same Pause playbook that the issue does not occur when AKS is deployed with this preview feature.

Additional links from our ticket:

ecurti commented 2 years ago

@kurokobo thank you for your workaround. You saved me! :)

I can confirm that is working on AWX version 21.6.0 and AKS 1.23.12.

markstoel commented 1 year ago

@kurokobo I also thank you for your workaround. It was a life saver. But I upgraded to AKS 1.25.4 and AWX version 21.11.0 and the problem seems resolved now.

kurokobo commented 1 year ago

Native periodical keep-alive message has been implemented in Ansible Runner 2.3.2, and AWX has been updated to use it in 21.14.0.

To activate keep-alive message:

@AlanCoding @nitzmahone Thanks for implementing that feature πŸ˜„ I believe this issue can be closed now.