Closed dbanttari closed 1 year ago
This seems similar/related? https://github.com/kubernetes/kubernetes/issues/28369
We recently saw this in an AKS deployment. I've verified a workaround by setting kubelet --container-log-max-size
to a higher number, or using containerLogMaxSize
in a kubelet config file.
https://github.com/ansible/awx/issues/10366#issuecomment-951852069 @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods
being set to 30. Can you try it and report back?
if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi
Hello, We are in the same case with our AWX instance (same version). We are currently in a GKE cluster so we don't have control over the kubelet configuration. Do you plan to fix this soon?
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.
Hello @shanemcd , This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool. Why is this not a top priority? This is a regression from previous versions
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.
Running this in kubernetes 1.20+ should be called as strictly unsupported and broken I feel like
From #11511:
FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.
When we switched the runtime to docker://19.3.15, it actually fixed our problem.
So things are now in a working state, but it will be EOL for us in about 5 months.
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.
I agree this is a contanerd/k8s issue. But it also best practice to "limit' logs from filling up filesystems.
But relying on the logging systems of k8s does not seem reliable.
Are te following options valid? Move to openshift? Is it possible to move the execution nodes to podman (and does podman have this issue?)
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.
If this https://github.com/kubernetes/kubernetes/issues/59902 was implemented, would not that help?
I can also confirm is it an issue for us too. For some problematic playbooks with lots of output and multiple hosts, the only solution we found, was to create a playbook that will create a pod which runs the playbook directly on the k8s cluster, so outside of AWX :-( In our case, increasing the log size is not an option, because we have other workloads on the cluster and increasing the log file size will drastically increase the disk space.
Thanks
Hi same problem in k3s, this is a real problem for us also
Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22
Fixed with kubelet --container-log-max-size
Using AKS and Terraform we were able to get enough headroom by specifying:
resource "azurerm_kubernetes_cluster_node_pool" "pool" {
...
kubelet_config = {
container_log_max_line = 5
container_log_max_size_mb = 500
}
...
}
The default, I believe, was 10MB.
#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the
max-pods
being set to 30. Can you try it and report back?
if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi
Hey, I tried your suggestion but included--container-log-max-files=3 before I read that 5 is the default value... And even though the daemonset pods failed to start with Crashloopback and I deleted them a couple of minutes later somehow the kubelet config file now looks like this:
KUBELET_FLAGS=--address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/
azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.0
--cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-g
ates=RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false -
-kube-reserved=cpu=100m,memory=1843Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=35 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-ma
x-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container
-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --co
ntainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files
=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-ma
x-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container
-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --co
ntainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500
Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-s
ize=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-lo
g-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --conta
iner-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-f
iles=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-lo
g-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --conta
iner-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size
=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-m
ax-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --containe
r-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --c
ontainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-file
s=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-m
ax-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --containe
r-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --c
ontainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=50
0Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-
size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-l
og-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --cont
ainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-
files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-l
og-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --cont
ainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --network-plugin=cni --
node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.5 --pod-manifest-path=/etc/kubernete
s/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-con
nection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM
_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_
WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/kubeletserver.key
KUBELET_REGISTER_SCHEDULABLE=true
NETWORK_POLICY=azure
KUBELET_NODE_LABELS=kubernetes.azure.com/role=agent,agentpool=nodepoold,kubernetes.azure.com/agentpool=nodepoold,storageprofile=managed,storagetier=Premium_LRS,kube
rnetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/cluster=MC_rg_cluster_westeurope
,kubernetes.azure.com/mode=system,kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2021.09.28,nodepool=defaultnodepoolnew
root@aks-nodepoolD:/etc/default#
Funny enough everything works exactly as before and as far as I can tell, nothing was removed, only the two parameters got added a gazzilion times... Anyway I would like to fix this mess and restore the default kubelet config file if possible.
Could anyone suggest how to fix this please?
Lucky for us, I've also setup ARA. so when the log file rotates, ARA still has the complete job data..
Any update about the resolution of this problem @shanemcd ? please
Also see some good quality research in the ansible-runner issue https://github.com/ansible/ansible-runner/issues/998
This is particularly a problem for inventory updates that produce large inventories, because the archive of the artifacts necessarily has to be large, and the way that is sent requires precise byte alignment or else it won't work.
Trying to think of where / how a patch might be developed... a change to awx-operator might just increase the size that it can handle but not fundamentally address the issue. If I could reproduce it, maybe I could get the syntax of what the controller node reads, and figure out how to identify these breaks in the ansible-runner code.
For me the same behavior happens when automation-job pod finished and is destroyed. In pod logs i could see that all happened correctly but AWX jobs have Error with trimmed log output.
@AlanCoding to force a reproducer, you could try setting kubelet --container-log-max-size
to a very small number
Any update about the resolution?
Hi same problem in k3s, this is a real problem for us also
Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22
Fixed with kubelet --container-log-max-size
Hi - trying to apply the mentioned workaround for k3s, too, but currently I'm unsure on where and how to apply the container-log-max-size configuration. Could you please give me a hint, where and how you deployed that?
Thanks,
Andreas
@andreasbourges
Modify /etc/systemd/system/k3s.service
and restart K3s,
$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
server \
'--write-kubeconfig-mode' \
'644' \
'--kubelet-arg' \ 👈👈👈
'container-log-max-files=4' \ 👈👈👈
'--kubelet-arg' \ 👈👈👈
'container-log-max-size=50Mi' \ 👈👈👈
$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s
Or re-install K3s using script with specific arguments,
$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"
@andreasbourges Modify
/etc/systemd/system/k3s.service
and restart K3s,$ cat /etc/systemd/system/k3s.service ... ExecStart=/usr/local/bin/k3s \ server \ '--write-kubeconfig-mode' \ '644' \ '--kubelet-arg' \ 👈👈👈 'container-log-max-files=4' \ 👈👈👈 '--kubelet-arg' \ 👈👈👈 'container-log-max-size=50Mi' \ 👈👈👈 $ sudo systemctl daemon-reload $ sudo systemctl restart k3s
Or re-install K3s using script with specific arguments,
$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"
Thanks a lot - this saves me a lot of time!
Highly appreciated!
Hi! I have been reading all your valuable information and tests for the last days, because we are facing the same issue. I just wanted to let you know that, in our case, changing the log configuration didn't solve it, so I'm afraid we have come to the wall of "4h maximum connection time" that is commented in this other issue:
https://github.com/ansible/awx/issues/11805#issuecomment-1249169747
Hi,
thanks for sharing your experience - I adjusted the amount of data processed in the playbook (way too much information was gathered via an http request to Autobot) and I adjusted the container-log size.
What I can tell is, that opening the output log from the aborted tasks, still triggers the uwsgi consuming GBs of memory. We're hoping that adjusting memory consumption will prevent the tasks from failing and thus avoid the uwsgi problem.
Well, let's see.
Thanks,
Andreas
Edit by Gundalow to tidy up formatting
Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.
Hi Braden,
.I totally agree - this is a non-deterministic workaround, but in combination with the reduction of the generated logs we're running fine so far - not a single error since the change (before we had a chance of 50:50). And in the absence of any alternatives, we are happy to have this one.
Do you know of any other ways to solve this?
Thanks,
Andreas
Von: Braden Schaeffer @.> Gesendet: Mittwoch, 12. Oktober 2022 16:20 An: ansible/awx @.> Cc: andreasbourges @.>; Mention @.> Betreff: Re: [ansible/awx] AWX stops gathering job output if kubernetes starts a new log (Issue #11338)
Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.
- Reply to this email directly, view it on GitHub https://github.com/ansible/awx/issues/11338#issuecomment-1276260824 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AIB4YVQ5FYXYS7JVDRJY5RDWC 3CJRANCNFSM5H3JFP5A . You are receiving this because you were mentioned.Message ID: @.***>
fixed in ansible/receptor#683
Not for me — I seem to still get only the first chunk of logs. Running AWX receptor version 1.4.1 against OpenShift 3.11's (I know, I know) Kubernetes 1.11.0+d4cacc0.
It is worth noting that the /tmp/receptor/awx-0/*/stdout
files do not end with (or contain) an { "eof": true }
marker in this case, indicating that Receptor at least understands that it has work remaining.
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system.
Indeed. I seem to remember that at some point, AWX used to rely on RabbitMQ for this task?
this issue seems finally fixed for me with kubernetes 1.29
Please confirm the following
Summary
If a very large job is run, sometimes Kubernetes will start a new
log
, but AWX's reporting of job progess stalls. Eventually the job ends as anError
, even though the job itself continues to completion in the podAWX version
19.4.0
Installation method
kubernetes
Modifications
yes
Ansible version
2.11.3.post0
Operating system
Azure Kubernetes Service 1.21.2
Web browser
Chrome
Steps to reproduce
We can tail the log of the pod in real time using kubectl
this window stalls until the job is complete and the pod exits
opened a new window and it resumes:
Note that this example is discontiguous, which implies that this may happen multiple times during a pod's life
Expected results
Expected AWX to continue reporting the progress of the job, and noting its proper completion status (in this case, "Failed")
Actual results
The last item presented by AWX was the last entry in the log before it started anew:
Additional information
Ansible pod image is based on awx-ee:0.6.0 but adds things like helm, openssl, azure cli, zabbix-api, and other libraries that are used by various tasks
Installation was done via AWX Operator 0.14.0