Kernel freeze when running workloads on OKD

spasche commented 3 years ago

Describe the bug

When running OKD, which uses Fedora CoreOS 34 on the nodes, the kernel is sometimes freezing.

Original report on OKD bug tracker: https://github.com/openshift/okd/issues/864

Reproduction steps Steps to reproduce the behavior:

Deploy OKD, version 4.7.0-0.okd-2021-08-22-163618
Run workloads on the nodes, such as BuildConfig Pods

Expected behavior

System doesn't freeze

Actual behavior

Node VM is consuming 100% and doesn't respond to ping or from input in the console.

Unfortunately, the console doesn't show the full kernel panic message, it stops after the line: ------------[ cut here ]------------

I tried to retrieve logs using netconsole kernel module, hoping I could get more information, but the result is the same.

Do you have a suggestion how to get more data from the panic, if possible?

System details

VM running on KVM / libvirt 6.0.0
Fedora CoreOS 34
Kernel 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Ignition config

Since it's handled by OKD / machine operator, it's massive and might be difficult to sanitize.

depouill commented 3 years ago

Hi,

We have excatly same problem here when updating to 4.7.0-0.okd-2021-08-22-163618. Bare-metal nodes suddenly freeze on starting pods with message like (appear not every time):

[Mon Sep 13 23:16:13 2021] ------------[ cut here ]------------
[Mon Sep 13 23:16:13 2021] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
[Mon Sep 13 23:16:13 2021] WARNING: CPU: 112 PID: 0 at kernel/sched/fair.c:401 enqueue_task_fair+0x26f/0x6a0

dustymabe commented 3 years ago

hey all - I've created some FCOS artifacts with a dev kernel build with a kernel commit reverted that we think is the problem and posted them over in the other kernel issue. Not sure how easy it is with OKD to switch out the base media, but maybe you can try with those artifacts or just use rpm-ostree to override replace the kernel with something like:

sudo systemctl stop zincati
sudo rpm-ostree override replace https://kojipkgs.fedoraproject.org//work/tasks/2324/75662324/kernel{,-core,-modules}-5.13.16-200.fc34.dusty.x86_64.rpm --reboot

cgwalters commented 3 years ago

or just use rpm-ostree to override replace the kernel with something like:

Right, that seems like the much easier path.

spasche commented 3 years ago

Hello, Indeed, I was able to install the dev kernel using rpm-ostree override (zincati service is not registered on the node). Unfortunately, I'm still seeing the same issue, without much information in the logs. Here are the last lines:

[  364.755096] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to linux-mm@kvack.org if you depend on this functionality.
[  409.135556] hyperkube[1539]: E0915 13:18:41.027242    1539 cadvisor_stats_provider.go:401] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod02e0ad00_d8e1_42da_9753_59ec5acb8871.slice/crio-f3c934d3dca3f597dffc49d04dd3dde32a8e5b07aff8892d2812b66e86e56cdd.scope/kubepods-burstable-pod02e0ad00_d8e1_42da_9753_59ec5acb8871.slice": RecentStats: unable to find data in memory cache]
[  439.850553] ------------[ cut here ]------------

dustymabe commented 3 years ago

hmm - would definitely be nice to get more output after the [ cut here ] bits. Too bad that's not showing up in the console.

spasche commented 3 years ago

Yes, definitely! That's quite strange.

bo0ts commented 3 years ago

We are also experiencing this issue on OKD-4.7.0-0.okd-2021-08-22-163618. Is there any possible workaround?

depouill commented 3 years ago

We are also experiencing this issue on OKD-4.7.0-0.okd-2021-08-22-163618. Is there any possible workaround?

We have downgraded the kernel to the previous OKD version (4.7.0-0.okd-2021-08-07-063045): rpm-ostree override replace https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel{,-core,-modules}-5.12.19-300.fc34.x86_64.rpm

bo0ts commented 3 years ago

@depouill Thanks, did you experience any issues with that downgrade so far and have you taken any steps to disable automatic updates (pausing machineconfigpools etc.)?

depouill commented 3 years ago

@depouill Thanks, did you experience any issues with that downgrade so far and have you taken any steps to disable automatic updates (pausing machineconfigpools etc.)?

Nodes are stable since yesterday and with rpm-ostree override, machineconfig is ok (no need to pause MCP). Cluster is green.

m-yosefpor commented 3 years ago

@depouill Is 5.12.19-300 version stable? did you face this issue again since then?

depouill commented 3 years ago

@depouill Is 5.12.19-300 version stable? did you face this issue again since then?

no, since we downgraded to 5.12.19-300, cluster works fine (since one week). Note:

we downgraded worker nodes (no master nodes) and if I remember, 5.12.19-300 was kernel of 4.7.0-0.okd-2021-08-07-063045
we are facing same problem on baremetal, kvm and openstack nodes

m-yosefpor commented 3 years ago

@depouill Is 5.12.19-300 version stable? did you face this issue again since then?

no, since we downgraded to 5.12.19-300, cluster works fine (since one week). Note:
* we downgraded worker nodes (no master nodes) and if I remember, 5.12.19-300 was kernel of 4.7.0-0.okd-2021-08-07-063045

* we are facing same problem on baremetal, kvm and openstack nodes

Thanks for the info; so we should downgrade too.

We are also facing this issue on Openstack nodes, after upgraded to 4.7.0-0.okd-2021-08-22-163618 kernel: 5.13.4-200 and also even after upgrading to latest patch again 4.7.0-0.okd-2021-09-19-013247 kernel: 5.13.12-200 the problem still persists.

spasche commented 3 years ago

Updated to latest 4.8 (4.8.0-0.okd-2021-10-10-030117) with kernel 5.13.13-200.fc34.x86_64 and the issue is still present (unfortunately, I still don't get any messages on the console).

jacksgt commented 3 years ago

Hey, I just wanted to chime in and say we are seeing the exact same problems (kernel freeze, nodes need to be rebooted by hypervisor) after upgrading to 4.7.0-0.okd-2021-09-19-013247. Our web server workloads were especially heavily affected, but we also had some infra nodes (logging, monitoring etc.) exhibit the same behavior. We are running OpenStack VMs.

Thanks to the instructions from depouill, we were able to temporarily mitigate the issue with kernel 5.12.19-300.

For a more permanent fix, we investigated how we could build our own OKD node images. Unfortunately, this was quite complicated and I documented the required steps here: https://blog.cubieserver.de/2021/building-a-custom-okd-machine-os-image/

baryluk commented 3 years ago

We upgraded few days ago from 4.7.0-0.okd-2021-07-03-190901 to 4.8.0-0.okd-2021-10-10-030117 (with temporary quick upgrade to 4.7.0-0.okd-2021-09-19-013247 before upgrading to 4.8.0), and we are now experiencing kernel bugs and node freezes, requiring node reboot, or hardware reboot if it is completely unresponsive.

Sometimes it does show BUG, stuck, rcu stalls, etc. Sometimes it just stops.

This is bare metal, on AMD EPYC 7502P.

I am attaching some logs, including kernel stuff, from few machines that experienced the issue.

okd-4.8.0_linux-5.13.13-200_issues.tar.gz

We will downgrade to the kernel 5.12.19-300, and see if it helps, but it will be hard to confirm definitively, because the hangs / freezes are sporadic and not easily reproducible on demand.

aneagoe commented 3 years ago

As @baryluk mentioned, we've downgraded kernel (5.12.7-300.fc34.x86_64) and everything is now stable for us. We went for the version that had no issues before the upgrade (4.7.0-0.okd-2021-07-03-190901). I left one node (running as VM on top of PROXMOX, same underlying hardware) and I was not able to reproduce while stress-testing on the node (with kernel 5.13.13-200.fc34.x86_64). This is interesting since @depouill mentioned the issue also appeared on KVM instances, which is what PROXMOX is using.

spasche commented 3 years ago

Updated OKD 4 to the version released yesterday (4.8.0-0.okd-2021-10-24-061736). Nodes were updated to kernel 5.14.9-200.fc34.x86_64. The freeze issue is still present.

hugoatease commented 2 years ago

Still have this issue on OKD version 4.8.0-0.okd-2021-10-24-061736, kernel 5.14.9-200.fc34.x86_64 and Fedora CoreOS 34.20211004.3.1.

@depouill's fix doesn't work for me anymore, as rpm-ostree fails with Multiple subdirectories found in: usr/lib/modules on override.

I tried to downgrade the kernel with

rpm-ostree override replace --remove=kernel --remove=kernel-core --remove=kernel-modules --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-5.12.19-300.fc34.x86_64.rpm --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-core-5.12.19-300.fc34.x86_64.rpm --install=https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-modules-5.12.19-300.fc34.x86_64.rpm https://kojipkgs.fedoraproject.org/packages/kernel/5.12.19/300.fc34/x86_64/kernel-5.12.19-300.fc34.x86_64.rpm

which fails with Base packages not marked to be removed: kernel kernel-core kernel-modules.

jlebon commented 2 years ago

@depouill's fix doesn't work for me anymore, as rpm-ostree fails with Multiple subdirectories found in: usr/lib/modules on override.

This is an rpm-ostree bug fixed in v2021.12. The testing release we're currently working on will have the fix. To try it out before, you could do rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2021-b66a24701a.

I tried to downgrade the kernel with

Yeah, rpm-ostree is really strict about this. Doing a base package replacement is not the same as removing a base package and overlaying another.

scrayos commented 2 years ago

For what it's worth, this issue is not limited to OKD/OpenShift. We're having the exactly same problem with upstream Kubernetes (v1.21.6). We deploy the cluster with kubespray and every 1-2 days, the server just crashes. We've put absolutely no nodes on the server (except for the DaemonSet nodes that all nodes must host, like Calico). There is also no log output when this happens. The server just "stops".

Switching to the testing channel and upgrading to the latest version allowed me to install the kernel 5.12.19-300, that was recommended above and it seems like it fixed our problems as well. There have been no crashes for 3 days now. If that changes, I'm gonna post an update.

dustymabe commented 2 years ago

Hey @Scrayos (or anyone else). We would be overly joyed if someone could give us a reproducer for this (step by step instructions would be great). It sounds like you are saying you're not even deploying any applications, just running Kubernetes and it's crashing for you?

scrayos commented 2 years ago

Hey @Scrayos (or anyone else). We would be overly joyed if someone could give us a reproducer for this (step by step instructions would be great). It sounds like you are saying you're not even deploying any applications, just running Kubernetes and it's crashing for you?

@dustymabe Exactly. I only included the node into the cluster and it kept crashing every 1-2 days. These were the only pods on the node:

So only networking and the prometheus node exporter. There was absolutely nothing else deployed on the node. The node was set up with kubespray.

So essentially I did this:

I set up the node with minimal ignition (coreos-installer install /dev/nvme0n1 -s stable -I <url-to-ignition>). The ignition only performs minor changes:
1. adding my public key for the core user
2. strengthening sshd_config
3. setting the CRYPTO_POLICY to empty, as it collided with the sshd_config
4. enabling periodic update windows for zincati
5. disabling SysRq
6. setting net.ipv4.all.rp_filter=1 in sysctl as per kubespray's FCOS requirements
7. setting up raid and mirrored boot disk (like in this official example)
8. The most major change is probably that we're disabling docker, like you recommended here.

I run a very small ansible playbook against this node that deploys a configuration to the NetworkManager for IPv6:


[connection]
id={{ interface_name }}
uuid={{ interface_name | to_uuid }}
type=ethernet
interface-name={{ interface_name }}

[ipv4] method=auto {% for subnet in ipv4_subnets %} address{{ loop.index }}={{ subnet | ansible.netcommon.ipsubnet }} {% endfor %} gateway={{ ipv4_gateway }}

[ipv6] method=auto {% for subnet in ipv6_subnets %} address{{ loop.index }}={{ subnet | ansible.netcommon.ipsubnet }} {% endfor %} gateway={{ ipv6_gateway }}


3. Then I run the `cluster.yml` playbook of kubespray
4. Because of the recent changes to fedora-modular, the cluster.yml fails halfway through because we need cri-o, so I run `sudo rpm-ostree ex module install cri-o:1.20/default` then.
5. After that, the `cluster.yml` playbook is executed again (successfuly this time).

And that's about it. Then I just leave the server in idle and after 1-2 days it crashed three times in a row. Always with abruptly endling logs (sorry, I only made screenshots):
![image](https://user-images.githubusercontent.com/2124642/140520304-85a3bd49-c51b-4c23-92d3-07c36f0cc210.png)
![image](https://user-images.githubusercontent.com/2124642/140520316-0a52ef72-1d00-4317-85e4-22ccb93ca0cc.png)

To summarize:
* we use kubespray for deployment
* we use cri-o as the container engine
* we use Calico for the networking
* we "use" MetalLB for load balancing (we've only deployed it for now, but not actively used, because the nodes were so unstable)
* our kubernetes cluster has version 1.21.6
* both errors occured on FCOS version `34.20211004.3.1`

I hope any of this helps.

aneagoe commented 2 years ago

@Scrayos can you please provide full HW specs? Or at least CPU, MB, RAM and perhaps disks. For us, we would notice about 1 node go down per day (AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003 960GB NVMe disks). Workload is mixed (java, python, spark to name a few). On a test node (VM) we were not able to reproduce this but I'm trying to push some java-based benchmark there soon in the hope of getting it to crash.

scrayos commented 2 years ago

@Scrayos can you please provide full HW specs? Or at least CPU, MB, RAM and perhaps disks. For us, we would notice about 1 node go down per day (AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003 960GB NVMe disks). Workload is mixed (java, python, spark to name a few). On a test node (VM) we were not able to reproduce this but I'm trying to push some java-based benchmark there soon in the hope of getting it to crash.

@aneagoe Sure! It's this hetzner server with upgraded ECC RAM.

CPU: AMD Ryzen 5 3600 6-Core Processor
Motherboard: ASUS Pro WS 565-ACE
Memory: 2 x Samsung M391A4G43AB1-CVF (DDR4 ECC, 64 GB in total)
Disks: 2 x SAMSUNG MZVL2512HCJQ-00B00 (NVMe, 1024 GB in total)

lucab commented 2 years ago

Looking at @baryluk logs, this may be some race related to a side-effect of accessing /proc/cpuinfo, which node_exporter reads quite frequently (kubelet and other tools too, but possibly way less frequently). If that's the case:

blocking the node_exporter pod from being deployed on a node may result in less frequent freezes
a read-sleep-loop on /proc/cpuinfo may be able to trigger the same kind of freezes, outside of k8s/okd

It seems to affect mostly AMD cpus. It could be because of a vendor-specific path in the kernel, or just because those CPUs usually have a large number of cores.

aneagoe commented 2 years ago

I'm running now mixed java workloads and also left running on node while true; do cat /proc/cpuinfo > /dev/null; done. This is a VM with 8 cores/32GB running on top of AMD EPYC 7502P, Asus KRPA-U16, 512GB RAM, 2 x SAMSUNG MZQLW960HMJP-00003. KVM is set to pass all CPU options (ie passthrough/host mode). I was not able to reproduce a single crash yet... to me it looks stable on VM. Unfortunately, I can't do this on a bare-metal node because they're all used in production.

dustymabe commented 2 years ago

@Scrayos - unfortunately I don't have access to Hetzner. Do you think there's any chance this would reproduce with the bare metal instances from AWS.

Also, you've given a lot of detail about your Ignition config (Thanks!). Any chance you could share it (or preferably the Butane version of it) with anything redacted that you didn't want to share?

scrayos commented 2 years ago

@dustymabe - Sure! I've actually got a base.bu and multiple extensions for the different server setups that reference the ignition of base.bu with the ignition.config.merge directive. To keep it simple, I've manually merged their values:

The butane file

```yml variant: 'fcos' version: '1.4.0' boot_device: # configure boot device mirroring for additional fault tolerance and robustness mirror: devices: - '/dev/nvme0n1' - '/dev/nvme1n1' storage: disks: # create and partition both of the drives identically - device: '/dev/nvme0n1' partitions: - label: 'root-1' # set the size to twice the recommended minimum size_mib: 16384 start_mib: 0 - label: 'var-1' - device: '/dev/nvme1n1' partitions: - label: 'root-2' # set the size to twice the recommended minimum size_mib: 16384 start_mib: 0 - label: 'var-2' raid: # add both of the var drives to a common raid for additional fault tolerance and robustness - name: 'md-var' level: 'raid1' devices: - '/dev/disk/by-partlabel/var-1' - '/dev/disk/by-partlabel/var-2' filesystems: # mount /var with the raid instead of individual hard drives - path: '/var' device: '/dev/md/md-var' format: 'xfs' wipe_filesystem: true with_mount_unit: true files: # configure strict defaults for ssh connections and negotiation algorithms - path: '/etc/ssh/sshd_config' mode: 0600 overwrite: true contents: inline: | # chroot sftp into its area and perform additional logging Subsystem sftp internal-sftp -f AUTHPRIV -l INFO # keep connections active ClientAliveInterval 30 ClientAliveCountMax 2 # disable unecessary rsh-support UseDNS no # do not let root in - core is much more uncommon PermitRootLogin no AllowUsers core # log key fingerprint on login, so we know who did what LogLevel VERBOSE # set log facility to authpriv so log access needs elevated permissions SysLogFacility AUTHPRIV # re-negotiate session key after either 500mb or one hour ReKeyLimit 500M 1h # only allow public-keys PubKeyAuthentication yes PasswordAuthentication no ChallengeResponseAuthentication no AuthenticationMethods publickey # set stricter login limits LoginGraceTime 30 MaxAuthTries 2 MaxSessions 5 MaxStartups 10:30:100 # adjust algorithmus Ciphers chacha20-poly1305@openssh.com,aes128-gcm@openssh.com,aes256-gcm@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr HostKeyAlgorithms sk-ssh-ed25519@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,rsa-sha2-512,rsa-sha2-256 KexAlgorithms curve25519-sha256,diffie-hellman-group18-sha512,diffie-hellman-group16-sha512,diffie-hellman-group14-sha256,curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256 MACs umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com # adjust pluggable authentication modules # pam sends last login + coreos message UsePAM yes PrintLastLog no PrintMotd no # add ignition and afterburn keys to the allowed directories AuthorizedKeysFile .ssh/authorized_keys .ssh/authorized_keys.d/ignition .ssh/authorized_keys.d/afterburn # include the drop-in configurations Include /etc/ssh/sshd_config.d/*.conf # clear default crypto policy as we define it in ssh config manually - path: '/etc/sysconfig/sshd' mode: 0640 overwrite: true contents: inline: | CRYPTO_POLICY= # perform updates only in allowed time frames, so we don't have surprise downtimes - path: '/etc/zincati/config.d/55-updates-strategy.toml' mode: 0644 contents: inline: | [updates] strategy = "periodic" [[updates.periodic.window]] days = [ "Fri" ] start_time = "02:00" length_minutes = 60 # disable SysRq keys, so they won't be accidentally pressed (and we cannot use them anyways) - path: '/etc/sysctl.d/90-sysrq.conf' contents: inline: | kernel.sysrq = 0 # enable reverse path filtering for ipv4. necessary for calico (kubespray) - path: '/etc/sysctl.d/reverse-path-filter.conf' contents: inline: | net.ipv4.conf.all.rp_filter=1 directories: # delete all contents of the default sshd drop-ins and overwrite folder - path: '/etc/ssh/sshd_config.d' overwrite: true mode: 0700 user: name: 'root' group: name: 'root' systemd: units: # disable docker to use cri-o (see https://github.com/coreos/fedora-coreos-tracker/issues/229) - name: 'docker.service' mask: true passwd: users: # configure authentication - name: 'core' ssh_authorized_keys: - '{myPublicKey}' ```

I can't tell anything regarding the bare metal instances from AWS though, as I've never used AWS before. But it's certainly possible because I doubt that everyone here uses Hetzner and we all got the same problem, so it's unlikely that this is related to the hardware or setup from Hetzner.

aneagoe commented 2 years ago

@Scrayos The issue seems to have been "seemingly" fixed; see this comment: https://github.com/coreos/fedora-coreos-tracker/issues/940#issuecomment-966921015. Would be great if you could also test this and confirm the same. ATM I don't have any spare bare-metal to try it on :(

scrayos commented 2 years ago

@Scrayos The issue seems to have been "seemingly" fixed; see this comment: #940 (comment). Would be great if you could also test this and confirm the same. ATM I don't have any spare bare-metal to try it on :(

I've now re-ignited the node with the newest kernel (5.14.14-200.fc34.x86_64) and FCOS version (34.20211031.3.0). We'll know in a few days whether the server is stable now. :laughing:

spasche commented 2 years ago

I updated OKD to version 4.8.0-0.okd-2021-11-14-052418, which ships with kernel 5.14.14-200.fc34.x86_64. I'm not able to reproduce any freeze with the workload that was causing issues. Seems to be quite stable 🥳. If it's also stable for others, I guess we can close this.

scrayos commented 2 years ago

The node is running for roughly 3 days now and there was no crash so far. Seems like it's fixed for me as well! :tada:

spasche commented 2 years ago

Thanks for the feedback. I'll close it then.

dustymabe commented 2 years ago

Thanks all for collaborating and helping us to find when this issue was fixed. I wish we could narrow it down to a particular kernel commit that fixed the problem, but the fact that it's fixed in 34.20211031.3.0 and later should suffice.

gialloguitar commented 2 years ago

Issue still actual into 5.14.9-200.fc34.x86_64 kernel for OKD 4.8

aneagoe commented 2 years ago

@gialloguitar that's expected, see https://github.com/coreos/fedora-coreos-tracker/issues/957#issuecomment-950878770. Kernel version 5.14.14-200.fc34.x86_64 from OKD 4.8 version 4.8.0-0.okd-2021-11-14-052418 works just fine though.

jacksgt commented 2 years ago

Indeed, I can affirm that 5.14.14-200.fc34.x86_64 from 4.8.0-0.okd-2021-11-14-052418 works fine on several of our clusters.

coreos / fedora-coreos-tracker

Kernel freeze when running workloads on OKD #957