Closed spasche closed 2 years ago
Hi,
We have excatly same problem here when updating to 4.7.0-0.okd-2021-08-22-163618. Bare-metal nodes suddenly freeze on starting pods with message like (appear not every time):
[Mon Sep 13 23:16:13 2021] ------------[ cut here ]------------
[Mon Sep 13 23:16:13 2021] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
[Mon Sep 13 23:16:13 2021] WARNING: CPU: 112 PID: 0 at kernel/sched/fair.c:401 enqueue_task_fair+0x26f/0x6a0
hey all - I've created some FCOS artifacts with a dev kernel build with a kernel commit reverted that we think is the problem and posted them over in the other kernel issue. Not sure how easy it is with OKD to switch out the base media, but maybe you can try with those artifacts or just use rpm-ostree to override replace the kernel with something like:
sudo systemctl stop zincati
sudo rpm-ostree override replace https://kojipkgs.fedoraproject.org//work/tasks/2324/75662324/kernel{,-core,-modules}-5.13.16-200.fc34.dusty.x86_64.rpm --reboot
or just use rpm-ostree to override replace the kernel with something like:
Right, that seems like the much easier path.
Hello,
Indeed, I was able to install the dev kernel using rpm-ostree override
(zincati service is not registered on the node).
Unfortunately, I'm still seeing the same issue, without much information in the logs. Here are the last lines:
[ 364.755096] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[ 409.135556] hyperkube[1539]: E0915 13:18:41.027242 1539 cadvisor_stats_provider.go:401] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod02e0ad00_d8e1_42da_9753_59ec5acb8871.slice/crio-f3c934d3dca3f597dffc49d04dd3dde32a8e5b07aff8892d2812b66e86e56cdd.scope/kubepods-burstable-pod02e0ad00_d8e1_42da_9753_59ec5acb8871.slice": RecentStats: unable to find data in memory cache]
[ 439.850553] ------------[ cut here ]------------
hmm - would definitely be nice to get more output after the [ cut here ]
bits. Too bad that's not showing up in the console.
Yes, definitely! That's quite strange.
We are also experiencing this issue on OKD-4.7.0-0.okd-2021-08-22-163618. Is there any possible workaround?
We are also experiencing this issue on OKD-4.7.0-0.okd-2021-08-22-163618. Is there any possible workaround?
We have downgraded the kernel to the previous OKD version (4.7.0-0.okd-2021-08-07-063045):
rpm-ostree override replace https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel{,-core,-modules}-5.12.19-300.fc34.x86_64.rpm
@depouill Thanks, did you experience any issues with that downgrade so far and have you taken any steps to disable automatic updates (pausing machineconfigpools etc.)?
@depouill Thanks, did you experience any issues with that downgrade so far and have you taken any steps to disable automatic updates (pausing machineconfigpools etc.)?
Nodes are stable since yesterday and with rpm-ostree override, machineconfig is ok (no need to pause MCP). Cluster is green.
@depouill Is 5.12.19-300
version stable? did you face this issue again since then?
@depouill Is
5.12.19-300
version stable? did you face this issue again since then?
no, since we downgraded to 5.12.19-300, cluster works fine (since one week). Note:
@depouill Is
5.12.19-300
version stable? did you face this issue again since then?no, since we downgraded to 5.12.19-300, cluster works fine (since one week). Note:
* we downgraded worker nodes (no master nodes) and if I remember, 5.12.19-300 was kernel of 4.7.0-0.okd-2021-08-07-063045 * we are facing same problem on baremetal, kvm and openstack nodes
Thanks for the info; so we should downgrade too.
We are also facing this issue on Openstack nodes, after upgraded to 4.7.0-0.okd-2021-08-22-163618
kernel: 5.13.4-200
and also even after upgrading to latest patch again 4.7.0-0.okd-2021-09-19-013247
kernel: 5.13.12-200
the problem still persists.
Updated to latest 4.8 (4.8.0-0.okd-2021-10-10-030117) with kernel 5.13.13-200.fc34.x86_64 and the issue is still present (unfortunately, I still don't get any messages on the console).
Hey,
I just wanted to chime in and say we are seeing the exact same problems (kernel freeze, nodes need to be rebooted by hypervisor) after upgrading to 4.7.0-0.okd-2021-09-19-013247
. Our web server workloads were especially heavily affected, but we also had some infra nodes (logging, monitoring etc.) exhibit the same behavior. We are running OpenStack VMs.
Thanks to the instructions from depouill, we were able to temporarily mitigate the issue with kernel 5.12.19-300
.
For a more permanent fix, we investigated how we could build our own OKD node images. Unfortunately, this was quite complicated and I documented the required steps here: https://blog.cubieserver.de/2021/building-a-custom-okd-machine-os-image/
We upgraded few days ago from 4.7.0-0.okd-2021-07-03-190901
to 4.8.0-0.okd-2021-10-10-030117
(with temporary quick upgrade to 4.7.0-0.okd-2021-09-19-013247
before upgrading to 4.8.0), and we are now experiencing kernel bugs and node freezes, requiring node reboot, or hardware reboot if it is completely unresponsive.
Sometimes it does show BUG, stuck, rcu stalls, etc. Sometimes it just stops.
This is bare metal, on AMD EPYC 7502P.
I am attaching some logs, including kernel stuff, from few machines that experienced the issue.
okd-4.8.0_linux-5.13.13-200_issues.tar.gz
We will downgrade to the kernel 5.12.19-300
, and see if it helps, but it will be hard to confirm definitively, because the hangs / freezes are sporadic and not easily reproducible on demand.
As @baryluk mentioned, we've downgraded kernel (5.12.7-300.fc34.x86_64
) and everything is now stable for us. We went for the version that had no issues before the upgrade (4.7.0-0.okd-2021-07-03-190901
).
I left one node (running as VM on top of PROXMOX, same underlying hardware) and I was not able to reproduce while stress-testing on the node (with kernel 5.13.13-200.fc34.x86_64
). This is interesting since @depouill mentioned the issue also appeared on KVM instances, which is what PROXMOX is using.
Updated OKD 4 to the version released yesterday (4.8.0-0.okd-2021-10-24-061736). Nodes were updated to kernel 5.14.9-200.fc34.x86_64. The freeze issue is still present.
Still have this issue on OKD version 4.8.0-0.okd-2021-10-24-061736
, kernel 5.14.9-200.fc34.x86_64
and Fedora CoreOS 34.20211004.3.1
.
@depouill's fix doesn't work for me anymore, as rpm-ostree
fails with Multiple subdirectories found in: usr/lib/modules
on override.
I tried to downgrade the kernel with
rpm-ostree override replace --remove=kernel --remove=kernel-core --remove=kernel-modules --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-5.12.19-300.fc34.x86_64.rpm --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-core-5.12.19-300.fc34.x86_64.rpm --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-modules-5.12.19-300.fc34.x86_64.rpm https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-5.12.19-300.fc34.x86_64.rpm
which fails with Base packages not marked to be removed: kernel kernel-core kernel-modules
.
@depouill's fix doesn't work for me anymore, as
rpm-ostree
fails withMultiple subdirectories found in: usr/lib/modules
on override.
This is an rpm-ostree bug fixed in v2021.12. The testing
release we're currently working on will have the fix. To try it out before, you could do rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2021-b66a24701a
.
I tried to downgrade the kernel with
Yeah, rpm-ostree is really strict about this. Doing a base package replacement is not the same as removing a base package and overlaying another.
For what it's worth, this issue is not limited to OKD/OpenShift. We're having the exactly same problem with upstream Kubernetes (v1.21.6). We deploy the cluster with kubespray and every 1-2 days, the server just crashes. We've put absolutely no nodes on the server (except for the DaemonSet nodes that all nodes must host, like Calico). There is also no log output when this happens. The server just "stops".
Switching to the testing
channel and upgrading to the latest version allowed me to install the kernel 5.12.19-300
, that was recommended above and it seems like it fixed our problems as well. There have been no crashes for 3 days now. If that changes, I'm gonna post an update.
Hey @Scrayos (or anyone else). We would be overly joyed if someone could give us a reproducer for this (step by step instructions would be great). It sounds like you are saying you're not even deploying any applications, just running Kubernetes and it's crashing for you?
Hey @Scrayos (or anyone else). We would be overly joyed if someone could give us a reproducer for this (step by step instructions would be great). It sounds like you are saying you're not even deploying any applications, just running Kubernetes and it's crashing for you?
@dustymabe Exactly. I only included the node into the cluster and it kept crashing every 1-2 days. These were the only pods on the node:
So only networking and the prometheus node exporter. There was absolutely nothing else deployed on the node. The node was set up with kubespray.
So essentially I did this:
coreos-installer install /dev/nvme0n1 -s stable -I <url-to-ignition>
). The ignition only performs minor changes:
sshd_config
CRYPTO_POLICY
to empty, as it collided with the sshd_config
net.ipv4.all.rp_filter=1
in sysctl as per kubespray's FCOS requirements
[connection]
id={{ interface_name }}
uuid={{ interface_name | to_uuid }}
type=ethernet
interface-name={{ interface_name }}
[ipv4] method=auto {% for subnet in ipv4_subnets %} address{{ loop.index }}={{ subnet | ansible.netcommon.ipsubnet }} {% endfor %} gateway={{ ipv4_gateway }}
[ipv6] method=auto {% for subnet in ipv6_subnets %} address{{ loop.index }}={{ subnet | ansible.netcommon.ipsubnet }} {% endfor %} gateway={{ ipv6_gateway }}
3. Then I run the `cluster.yml` playbook of kubespray
4. Because of the recent changes to fedora-modular, the cluster.yml fails halfway through because we need cri-o, so I run `sudo rpm-ostree ex module install cri-o:1.20/default` then.
5. After that, the `cluster.yml` playbook is executed again (successfuly this time).
And that's about it. Then I just leave the server in idle and after 1-2 days it crashed three times in a row. Always with abruptly endling logs (sorry, I only made screenshots):
![image](https://user-images.githubusercontent.com/2124642/140520304-85a3bd49-c51b-4c23-92d3-07c36f0cc210.png)
![image](https://user-images.githubusercontent.com/2124642/140520316-0a52ef72-1d00-4317-85e4-22ccb93ca0cc.png)
To summarize:
* we use kubespray for deployment
* we use cri-o as the container engine
* we use Calico for the networking
* we "use" MetalLB for load balancing (we've only deployed it for now, but not actively used, because the nodes were so unstable)
* our kubernetes cluster has version 1.21.6
* both errors occured on FCOS version `34.20211004.3.1`
I hope any of this helps.
@Scrayos can you please provide full HW specs? Or at least CPU, MB, RAM and perhaps disks. For us, we would notice about 1 node go down per day (AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003 960GB NVMe disks). Workload is mixed (java, python, spark to name a few). On a test node (VM) we were not able to reproduce this but I'm trying to push some java-based benchmark there soon in the hope of getting it to crash.
@Scrayos can you please provide full HW specs? Or at least CPU, MB, RAM and perhaps disks. For us, we would notice about 1 node go down per day (AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003 960GB NVMe disks). Workload is mixed (java, python, spark to name a few). On a test node (VM) we were not able to reproduce this but I'm trying to push some java-based benchmark there soon in the hope of getting it to crash.
@aneagoe Sure! It's this hetzner server with upgraded ECC RAM.
AMD Ryzen 5 3600 6-Core Processor
ASUS Pro WS 565-ACE
Samsung M391A4G43AB1-CVF
(DDR4 ECC, 64 GB in total)SAMSUNG MZVL2512HCJQ-00B00
(NVMe, 1024 GB in total)Looking at @baryluk logs, this may be some race related to a side-effect of accessing /proc/cpuinfo
, which node_exporter
reads quite frequently (kubelet and other tools too, but possibly way less frequently).
If that's the case:
node_exporter
pod from being deployed on a node may result in less frequent freezes/proc/cpuinfo
may be able to trigger the same kind of freezes, outside of k8s/okdIt seems to affect mostly AMD cpus. It could be because of a vendor-specific path in the kernel, or just because those CPUs usually have a large number of cores.
I'm running now mixed java workloads and also left running on node while true; do cat /proc/cpuinfo > /dev/null; done
. This is a VM with 8 cores/32GB running on top of AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003. KVM is set to pass all CPU options (ie passthrough/host mode). I was not able to reproduce a single crash yet... to me it looks stable on VM. Unfortunately, I can't do this on a bare-metal node because they're all used in production.
@Scrayos - unfortunately I don't have access to Hetzner. Do you think there's any chance this would reproduce with the bare metal instances from AWS.
Also, you've given a lot of detail about your Ignition config (Thanks!). Any chance you could share it (or preferably the Butane version of it) with anything redacted that you didn't want to share?
@dustymabe - Sure! I've actually got a base.bu
and multiple extensions for the different server setups that reference the ignition of base.bu
with the ignition.config.merge
directive. To keep it simple, I've manually merged their values:
I can't tell anything regarding the bare metal instances from AWS though, as I've never used AWS before. But it's certainly possible because I doubt that everyone here uses Hetzner and we all got the same problem, so it's unlikely that this is related to the hardware or setup from Hetzner.
@Scrayos The issue seems to have been "seemingly" fixed; see this comment: https://github.com/coreos/fedora-coreos-tracker/issues/940#issuecomment-966921015. Would be great if you could also test this and confirm the same. ATM I don't have any spare bare-metal to try it on :(
@Scrayos The issue seems to have been "seemingly" fixed; see this comment: #940 (comment). Would be great if you could also test this and confirm the same. ATM I don't have any spare bare-metal to try it on :(
I've now re-ignited the node with the newest kernel (5.14.14-200.fc34.x86_64) and FCOS version (34.20211031.3.0). We'll know in a few days whether the server is stable now. :laughing:
I updated OKD to version 4.8.0-0.okd-2021-11-14-052418, which ships with kernel 5.14.14-200.fc34.x86_64. I'm not able to reproduce any freeze with the workload that was causing issues. Seems to be quite stable 🥳. If it's also stable for others, I guess we can close this.
The node is running for roughly 3 days now and there was no crash so far. Seems like it's fixed for me as well! :tada:
Thanks for the feedback. I'll close it then.
Thanks all for collaborating and helping us to find when this issue was fixed. I wish we could narrow it down to a particular kernel commit that fixed the problem, but the fact that it's fixed in 34.20211031.3.0
and later should suffice.
Issue still actual into 5.14.9-200.fc34.x86_64 kernel for OKD 4.8
@gialloguitar that's expected, see https://github.com/coreos/fedora-coreos-tracker/issues/957#issuecomment-950878770. Kernel version 5.14.14-200.fc34.x86_64
from OKD 4.8 version 4.8.0-0.okd-2021-11-14-052418
works just fine though.
Indeed, I can affirm that 5.14.14-200.fc34.x86_64
from 4.8.0-0.okd-2021-11-14-052418
works fine on several of our clusters.
Describe the bug
When running OKD, which uses Fedora CoreOS 34 on the nodes, the kernel is sometimes freezing.
Original report on OKD bug tracker: https://github.com/openshift/okd/issues/864
Reproduction steps Steps to reproduce the behavior:
Expected behavior
System doesn't freeze
Actual behavior
Node VM is consuming 100% and doesn't respond to ping or from input in the console.
Unfortunately, the console doesn't show the full kernel panic message, it stops after the line:
------------[ cut here ]------------
I tried to retrieve logs using netconsole kernel module, hoping I could get more information, but the result is the same.
Do you have a suggestion how to get more data from the panic, if possible?
System details
Kernel 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Ignition config
Since it's handled by OKD / machine operator, it's massive and might be difficult to sanitize.