Closed seh closed 1 year ago
Thank you for reporting @seh , we'll have a look. Do you think you could investigate into a reliable repro case for this issue?
That is going to be very difficult, as so far it amounts to, "Run this Kubernetes cluster with this workload."
We have confirmed that downgrading to the Flatcar Container Linux beta version 2705.1.2 alleviates the problem. Again, versions 2765.1.0 and 2801.1.0 both suffer this same kernel bug.
Looking at the workload that runs on all the machines on which we've seen this occur, we identified only three in common:
We disabled Vector and proved that that was not the culprit. It wasn't feasible to disable "calico-node" and still have a functional Kubernetes cluster. (Swapping a CNI implementation in a production-grade cluster is a delicate operation.) We did not get as far as disabling Prometheus node exporter, though we're running it on every machine in several other Kubernetes clusters—that just happen to be less busy—so it's not likely it's at fault.
I neglected to mention earlier that in our clusters where Flatcar Container Linux's locksmithd service is enabled, we don't see this bug arise. In our clusters where _updateengine is enabled but locksmithd is disabled, the bug occurs on 10-15 out of 200 machines every day.
Interesting, thank you for sharing. While we're still looking for a solid repro the information you've provided will help with narrowing down the issue.
I mentioned that on our machines where both _updateengine and locksmithd are enabled that we don't see this kernel bug arising. However, I did notice something odd in the system logs on those machines.
I've been polling our machines regularly via SSH, running a command like journalct --grep=skbuff and collecting the output, in order to see how often and on which machines the kernel bug has been occurring. On some of the machines with locksmithd enabled, I see output from that command like this:
Notice how it keeps flapping between two different IDs. What does that indicate?
Are these machines in-place-upgraded from CoreOS or were they freshly installed with Flatcar?
These are fresh "installations" on EC2 instances by way of the published AMIs.
Hello. We've noticed the same behavior here in our environment on VMWare provisioned VMs.
Few times a day a box which happens to be the busiest in terms of network load crashes with the following
May 19 18:30:45 rnqkbm401 kernel: kernel BUG at net/core/skbuff.c:4008!
-- Boot 757516b861de4db8a139aa895db71803 --
May 19 18:31:02 localhost kernel: Linux version 5.10.37-flatcar (build@pony-truck.infra.kinvolk.io) (x86_64-cros-linux-gnu-gcc (Gentoo Hardened 9.3.0-r1 p3) 9.3.0, GNU ld (Gentoo 2.35 p1) 2.35.0) #1 SMP Mon May 17 22:08:55 -00 2021
Happens on 5.10.32-flatcar as well. pstore is empty too.
Less loaded VMs (in terms of network I/O) even having same kernel don't crash. We also run kubernetes on them with calico as our CNI (with eBPF mode enabled).
We have another Kubernetes cluster with the same setup, but it's still running older kernel (e.g., Flatcar Container Linux by Kinvolk 2605.10.0 (Oklo) 5.4.83-flatcar), and there's no crashes at all.
Just a quick update: after disabling gso and gro on the box it hasn't crashed in 4 days already. We're monitoring the box, but it's already a great sign. It used to do it every other day or every day.
Almost 7 days now without crashing since GSO/GRO got disabled.
How did you disable those, Igor?
ethtool -K <iface name> gso off
ethtool -K <iface name> gro off
where iface name is your NIC card, e.g. eth0.
We have a systemd unit now to disable it on boot.
Is this still occurring with the most recent releases? Could you also test alpha which has kernel 5.15, which might not trigger this any longer.
We've seen it most recently two weeks ago with kernel version 5.10.84, which we received by way of an upgrade when rebooting one of our machines that started life with Flatcar version 2705.1.2.
It may be another couple of weeks before I can offer any testing outcome. What changed recently that you think may alleviate this problem?
Nothing @seh, I wast just hoping it might have resolved itself.
Would you be able to capture the full splat on the serial console, including the stacktrace?
Next time I see it, I will grab all I can from journalctl. Is there another source you’re recommending that I collect as well?
In case your system has a pstore backend, you may find dmesg traces in /var/lib/systemd/pstore/
on the next boot. The files get moved there for persistent storage instead of staying in /sys/fs/pstore
. I'll update the docs (Edit: done here https://github.com/flatcar-linux/flatcar-docs/pull/206).
Note that I mentioned in my initial description that our pstore directory winds up empty after these reboots, perhaps for lack of hardware support.
Maybe, but it could be that systemd-pstore.service
ran and moved them to /var/lib/systemd/pstore
, that's what I wanted to hint on.
Edit: check whether you have pstore support by looking if /sys/module/pstore/parameters/backend
contains something else than (null)
@igcherkaev
ethtool -K <iface name> gso off ethtool -K <iface name> gro off
where iface name is your NIC card, e.g. eth0.
We have a systemd unit now to disable it on boot.
Is this still working for you?
This is still happening to us with Flatcar Container Linux version 3227.1.1.
This still seems to be happening in 3033.3.5
worse so i do not seem to be able to do the workaround
$ sudo ethtool -K eth0 gso off
Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supporte
We suffered through this bug through the night and this morning, and have found the workaround suggested by @igcherkaev in https://github.com/flatcar/Flatcar/issues/378#issuecomment-847189739 is working acceptably, so long as we're using Flatcar Container Linux with a kernel at version 5.15 or so. We found that using that new of a kernel wasn't enough without disabling generic receive and segmentation offload, and disabling that offload wasn't enough without a new enough kernel. In particular, kernel version 5.10.137 as offered by the LTS 3033.3.5 release wasn't new enough.
Here are the two systemd units I wrote to ensure that we toggle the offload off.
Using the stable Flatcar Container Linux version 3227.2.2 atop kernel version 5.15.63, we see this kernel bug occur in file net/core/skbuff.c on line 4219 when the expression list_skb->head_frag
is false, due to that (bit)field being false.
If we disable GRO and GSO (we're not yet sure if it's crucial to disable both of these), we skirt this kernel bug, but the network performance suffers so drastically that we can't afford to run our workload like that.
Can we interact with upstream maintainers despite having no clear trace and did someone start that discussion? The source code link from the BUG at net/core/skbuff.c:123!
messages and the workarounds may give some hints already.
We noticed that when running with GRO and GSO enabled again, with MTU ratcheted down on the eth0 interface from the default 9,001 to 1,500, this time using Flatcar Container Linux beta version 3346.1.0 and kernel version 5.15.70 atop the "m5.4xlarge" EC2 instance type, we see a different problem arise: Instead of the kernel reporting through the BUG_ON
macro and rebooting, it reports a hardware checksum failure, and keeps going, albeit with degraded network performance afterward.
Please see this log fragment for an example.
Here are CPU details via lscpu:
On this machine, the pstore facility remains unavailable to us.
Running with GRO and GSO enabled with the MTU for the eth0 interface back up at 9,000, this time using Flatcar Container Linux version 3227.2.2 and kernel version 5.15.63 atop the "z1d.12xlarge" EC instance type, the kernel bug does arise, but now we're getting more diagnostic output, per the following log fragment.
Thanks @seh, these kinds of logs are enough to start a discussion on lkml. I'll start a thread. Just to be sure I have all the facts straight: this is using ENA?
If by ENA you mean Elastic Network Adapter, then I think the answer is yes. We didn't do anything deliberate to choose that, but running modinfo ena shows that the module is installed.
My colleague @nbourikas disabled panic upon softlockup and was able to capture a more detailed failure trace atop kernel version 5.15.70 and Calico version 3.21.5.
@tomastigera and @fasaxc, note this frame in that call stack, in between tcp_gso_segment
and inet_gso_segment
:
bpf_prog_2e6f5613f50238c5_calico_to_host_ep+0xa40/0x2cc8
@seh, would you be able to test with calico 3.23? This PR https://github.com/projectcalico/calico/pull/5753/files makes calico stop changing gso_size on vxlan decapsulation, which lkml suggests might be the cause (https://lore.kernel.org/netdev/194f6b02-8ee7-b5d7-58f3-6a83b5ff275d@gmail.com/).
Thank you for the suggestion. Yes, we've been testing Calico version 3.23.3 over the last couple of days together with Flatcar's beta version 3346.1.0. So far, we haven't been hitting this kernel bug. I'll have more confidence after another day or two of testing.
Apparently our testing did not tell the full story. It was a late one last night.
We've now seen this same kernel failure occur using Calico 3.23.2 with Flatcar Container Linux 3346.1.0 (kernel version 5.15.70) and Ubuntu 22.04.1 ("Jammy Jellyfish") (kernel version 5.15.0). The line number in file skbuff.c moves by one from 4218 to 4217 in the Ubuntu image. Disabling GRO and GSO again alleviates the rebooting problem for the moment, still at great cost for network performance.
That confirms for us that the problem is not specific to Flatcar Container Linux, but it does seem to be related to Calico's eBPF data plane.
same issue kernel BUG at net/core/skbuff.c:4082
on Red Hat Enterprise Linux release 8.6 (Ootpa)
with 4.18.0-372.26.1.el8_6.x86_64
Calico's eBPF data plane enabled. I agree with you @seh
Just to make sure we're following along on this side, did you all see Jiri's candidate patch that he mentioned in https://github.com/projectcalico/calico/issues/6865#issuecomment-1286936333?
with Jiri's candidate patch I'm not able to reproduce the issue anymore 4.18.0-372.26.1.el8_6.BZ_2136229_test_V1.x86_64
The fix has been proposed in kernel upstream https://patchwork.kernel.org/project/netdevbpf/patch/559cea869928e169240d74c386735f3f95beca32.1666858629.git.jbenc@redhat.com/
@jepio has built patches images: https://bincache.flatcar-linux.net/images/amd64/3346.1.99+issue-378-fix/ and @seh is testing them, maybe for others following that also may be interesting
So far, after five hours running with both GRO and GSO enabled, the machine (EC2 instance of type "z1d.12xlarge") has not crashed yet. Another machine running Flatcar Container Linux beta version 3346.1.0 and the same configuration otherwise (same EC2 instance type, same AZ, same workload) fails at least twice every hour.
The patch is queued up in netdev/next - as soon as it lands in linus' tree it can be submitted to stable. https://lore.kernel.org/netdev/166753501670.4086.1819802414418539212.git-patchwork-notify@kernel.org/#t
I see that the patch is present along Linux's "master" branch and is tagged with "v6.1-rc5" as of three days ago.
This patch is in 5.15.79, which is in beta as of yesterday (3417.1.0).
@seh, want to verify and then we'll close this issue at last?
We've been using this fix for about six weeks now with noticing any of these failures occurring. I consider this problem to be fixed. Thank you for all of your help with this one. It was quite a journey.
Description
On AWS EC2 instances using Flatcar Container Linux versions 2765.1.0 and 2801.1.0 from the Beta channel, with a kOps-provisioned Kubernetes installation on top, we encounter a kernel bug that causes the machines to stop and reboot immediately.
The log entries in journalctl appear as follows:
Sometimes the line number in file net/core/skbuff.c is 3,996 instead of 4,008. Usually we'll see 3,996 cited, then after the machine reboots, thereafter we'll see 4,008, suggesting that the rebooting swapped some updated files into place.
Note that we have locksmithd disabled, but update-engine is enabled, so we're downloading updates but not putting them into use eagerly.
Impact
Our fleet of Kubernetes cluster machines reboot periodically, causing the containers running on them to exit without warning and be replaced (in most cases) by the kubelet after a short delay.
Environment and steps to reproduce
a. Launch an EC2 instance using Flatcar Container Linux, perhaps via a supervising ASG. b. Allow various Kubernetes components to start (e.g. kubelet, CNI daemons). c. Periodically check the machine's last boot time. d. Inspect system logs with a command like journalct --grep=skbuff.
The machine will hum along normally, downloading updates occasionally, and running containers for Kubernetes workload. With no warning, the machine will reboot. Subsequent inspection of the log via journalctl shows a message like this:
One variation:
After the machine boots, the /sys/fs/pstore directory mentioned here exists, but is empty. The "pstore" mount entry is as follows:
Perhaps our hardware does not support pstore, per the following uname -a output:
Expected behavior The machine should continue running normally without encountering errors that cause it to reboot without warning.
Additional information We run similar Kubernetes cluster in several other AWS regions:
We have not seen this failure occur in those regions. We see it predominantly in "eu-west-1" and occasionally in "us-east-2." That could be due to more intense workload in the clusters in the former region.