AWX stops gathering job output if kubernetes starts a new log

dbanttari commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

If a very large job is run, sometimes Kubernetes will start a new log, but AWX's reporting of job progess stalls. Eventually the job ends as an Error, even though the job itself continues to completion in the pod

AWX version

19.4.0

Installation method

kubernetes

Modifications

yes

Ansible version

2.11.3.post0

Operating system

Azure Kubernetes Service 1.21.2

Web browser

Chrome

Steps to reproduce

We can tail the log of the pod in real time using kubectl

->k logs -f -n awx automation-job-29506-2qlhn
...
...
...
{"uuid": "021508bc-6424-4b63-a394-a42f0aaeab36", "counter": 42648, "stdout": "", "start_line": 23274, "end_line": 23274, "runner_ident": "29506", "event": "runner_on_start", "job_id": 29506, "pid": 19, "created": "2021-11-11T19:51:19.974410", "parent_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "play_uuid": "261db21c-7d3e-ca0c-65e3-00000000031f", "play_pattern": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "task": "Create component in udeploy", "task_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "task_action": "uri", "task_args": "", "task_path": "/runner/project/roles/udeploy-agent/tasks/add_component.yml:26", "role": "udeploy-agent", "host": "**********", "uuid": "021508bc-6424-4b63-a394-a42f0aaeab36"}}

this window stalls until the job is complete and the pod exits

opened a new window and it resumes:

->k logs -f -n awx automation-job-29506-2qlhn
{"uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb", "counter": 56862, "stdout": "\u001b[0;32mok: [dvuc-rhb1b-vm0]\u001b[0m", "start_line": 32869, "end_line": 32870, "runner_ident": "29506", "event": "runner_on_ok", "job_id": 29506, "pid": 19, "created": "2021-11-11T20:03:21.402116", "parent_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "play_uuid": "261db21c-7d3e-ca0c-65e3-0000000003c3", "play_pattern": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "task": "Make renamed executable for easier troubleshooting in multi-instance installations", "task_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "task_action": "ansible.windows.win_copy", "task_args": "", "task_path": "/runner/project/roles/tomcat/tasks/install-tomcat.yml:20", "role": "tomcat", "host": "***********", "remote_addr": "**********", "res": {"src": "**********", "dest": "**********", "checksum": null, "size": 131072, "original_basename": "tomcat9.exe", "changed": false, "operation": "file_copy", "_ansible_no_log": false}, "start": "2021-11-11T20:03:20.256071", "end": "2021-11-11T20:03:21.401776", "duration": 1.145705, "event_loop": null, "uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb"}}
...
...
...
{"status": "failed", "runner_ident": "29506"}
{"zipfile": 85344}
*** zipfile contents here ***
{"eof": true}

Note that this example is discontiguous, which implies that this may happen multiple times during a pod's life

Expected results

Expected AWX to continue reporting the progress of the job, and noting its proper completion status (in this case, "Failed")

Actual results

The last item presented by AWX was the last entry in the log before it started anew:

Screen Shot 2021-11-11 at 2 10 28 PM

Additional information

Ansible pod image is based on awx-ee:0.6.0 but adds things like helm, openssl, azure cli, zabbix-api, and other libraries that are used by various tasks

Installation was done via AWX Operator 0.14.0

dbanttari commented 2 years ago

This seems similar/related? https://github.com/kubernetes/kubernetes/issues/28369

shanemcd commented 2 years ago

We recently saw this in an AKS deployment. I've verified a workaround by setting kubelet --container-log-max-size to a higher number, or using containerLogMaxSize in a kubelet config file.

chrismeyersfsu commented 2 years ago

https://github.com/ansible/awx/issues/10366#issuecomment-951852069 @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

Tioborto commented 2 years ago

Hello, We are in the same case with our AWX instance (same version). We are currently in a GKE cluster so we don't have control over the kubelet configuration. Do you plan to fix this soon?

shanemcd commented 2 years ago

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

Tioborto commented 2 years ago

Hello @shanemcd , This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool. Why is this not a top priority? This is a regression from previous versions

bschaeffer commented 2 years ago

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

Running this in kubernetes 1.20+ should be called as strictly unsupported and broken I feel like

bschaeffer commented 2 years ago

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

MalfuncEddie commented 2 years ago

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

I agree this is a contanerd/k8s issue. But it also best practice to "limit' logs from filling up filesystems.

But relying on the logging systems of k8s does not seem reliable.

Are te following options valid? Move to openshift? Is it possible to move the execution nodes to podman (and does podman have this issue?)

oliverf1 commented 2 years ago

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

If this https://github.com/kubernetes/kubernetes/issues/59902 was implemented, would not that help?

I can also confirm is it an issue for us too. For some problematic playbooks with lots of output and multiple hosts, the only solution we found, was to create a playbook that will create a pod which runs the playbook directly on the k8s cluster, so outside of AWX :-( In our case, increasing the log size is not an option, because we have other workloads on the cluster and increasing the log file size will drastically increase the disk space.

Thanks

chris93111 commented 2 years ago

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22

Fixed with kubelet --container-log-max-size

jogleasonjr commented 2 years ago

Using AKS and Terraform we were able to get enough headroom by specifying:

resource "azurerm_kubernetes_cluster_node_pool" "pool" { 
  ...

  kubelet_config = {
    container_log_max_line    = 5
    container_log_max_size_mb = 500
  }
  ...
}

The default, I believe, was 10MB.

def1nity commented 2 years ago

#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

Hey, I tried your suggestion but included--container-log-max-files=3 before I read that 5 is the default value... And even though the daemonset pods failed to start with Crashloopback and I deleted them a couple of minutes later somehow the kubelet config file now looks like this:

KUBELET_FLAGS=--address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/
azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.0
--cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-g
ates=RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false -
-kube-reserved=cpu=100m,memory=1843Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=35 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-ma
x-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container
-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --co
ntainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files
=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-ma
x-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container
-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --co
ntainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500
Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-s
ize=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-lo
g-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --conta
iner-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-f
iles=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-lo
g-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --conta
iner-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size
=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-m
ax-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --containe
r-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --c
ontainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-file
s=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-m
ax-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --containe
r-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --c
ontainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=50
0Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-
size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-l
og-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --cont
ainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-
files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-l
og-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --cont
ainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --network-plugin=cni --
node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.5 --pod-manifest-path=/etc/kubernete
s/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-con
nection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM
_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_
WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/kubeletserver.key
KUBELET_REGISTER_SCHEDULABLE=true
NETWORK_POLICY=azure

KUBELET_NODE_LABELS=kubernetes.azure.com/role=agent,agentpool=nodepoold,kubernetes.azure.com/agentpool=nodepoold,storageprofile=managed,storagetier=Premium_LRS,kube
rnetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/cluster=MC_rg_cluster_westeurope
,kubernetes.azure.com/mode=system,kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2021.09.28,nodepool=defaultnodepoolnew

root@aks-nodepoolD:/etc/default#

Funny enough everything works exactly as before and as far as I can tell, nothing was removed, only the two parameters got added a gazzilion times... Anyway I would like to fix this mess and restore the default kubelet config file if possible.

Could anyone suggest how to fix this please?

ThomasADavis commented 2 years ago

Lucky for us, I've also setup ARA. so when the log file rotates, ARA still has the complete job data..

Tioborto commented 2 years ago

Any update about the resolution of this problem @shanemcd ? please

AlanCoding commented 2 years ago

Also see some good quality research in the ansible-runner issue https://github.com/ansible/ansible-runner/issues/998

This is particularly a problem for inventory updates that produce large inventories, because the archive of the artifacts necessarily has to be large, and the way that is sent requires precise byte alignment or else it won't work.

Trying to think of where / how a patch might be developed... a change to awx-operator might just increase the size that it can handle but not fundamentally address the issue. If I could reproduce it, maybe I could get the syntax of what the controller node reads, and figure out how to identify these breaks in the ansible-runner code.

cerberek commented 2 years ago

For me the same behavior happens when automation-job pod finished and is destroyed. In pod logs i could see that all happened correctly but AWX jobs have Error with trimmed log output.

kdelee commented 2 years ago

@AlanCoding to force a reproducer, you could try setting kubelet --container-log-max-size to a very small number

ethemcemozkan commented 2 years ago

Any update about the resolution?

andreasbourges commented 2 years ago

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22

Fixed with kubelet --container-log-max-size

Hi - trying to apply the mentioned workaround for k3s, too, but currently I'm unsure on where and how to apply the container-log-max-size configuration. Could you please give me a hint, where and how you deployed that?

Thanks,

Andreas

kurokobo commented 2 years ago

@andreasbourges Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

andreasbourges commented 2 years ago

@andreasbourges Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

Thanks a lot - this saves me a lot of time!

Highly appreciated!

anasaizg commented 2 years ago

Hi! I have been reading all your valuable information and tests for the last days, because we are facing the same issue. I just wanted to let you know that, in our case, changing the log configuration didn't solve it, so I'm afraid we have come to the wall of "4h maximum connection time" that is commented in this other issue:

https://github.com/ansible/awx/issues/11805#issuecomment-1249169747

andreasbourges commented 2 years ago

Hi,

thanks for sharing your experience - I adjusted the amount of data processed in the playbook (way too much information was gathered via an http request to Autobot) and I adjusted the container-log size.

What I can tell is, that opening the output log from the aborted tasks, still triggers the uwsgi consuming GBs of memory. We're hoping that adjusting memory consumption will prevent the tasks from failing and thus avoid the uwsgi problem.

Well, let's see.

Thanks,

Andreas

Edit by Gundalow to tidy up formatting

bschaeffer commented 2 years ago

Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.

andreasbourges commented 2 years ago

Hi Braden,

.I totally agree - this is a non-deterministic workaround, but in combination with the reduction of the generated logs we're running fine so far - not a single error since the change (before we had a chance of 50:50). And in the absence of any alternatives, we are happy to have this one.

Do you know of any other ways to solve this?

Thanks,

Andreas

Von: Braden Schaeffer @.> Gesendet: Mittwoch, 12. Oktober 2022 16:20 An: ansible/awx @.> Cc: andreasbourges @.>; Mention @.> Betreff: Re: [ansible/awx] AWX stops gathering job output if kubernetes starts a new log (Issue #11338)

Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.

- Reply to this email directly, view it on GitHub https://github.com/ansible/awx/issues/11338#issuecomment-1276260824 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AIB4YVQ5FYXYS7JVDRJY5RDWC 3CJRANCNFSM5H3JFP5A . You are receiving this because you were mentioned.Message ID: @.***>

TheRealHaoLiu commented 1 year ago

fixed in https://github.com/ansible/receptor/pull/683/

domq commented 1 year ago

fixed in ansible/receptor#683

Not for me — I seem to still get only the first chunk of logs. Running AWX receptor version 1.4.1 against OpenShift 3.11's (I know, I know) Kubernetes 1.11.0+d4cacc0.

It is worth noting that the /tmp/receptor/awx-0/*/stdout files do not end with (or contain) an { "eof": true } marker in this case, indicating that Receptor at least understands that it has work remaining.

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system.

Indeed. I seem to remember that at some point, AWX used to rely on RabbitMQ for this task?

Klaas- commented 10 months ago

this issue seems finally fixed for me with kubernetes 1.29

ansible / awx