Flatcar Container Linux fails and reboots: "kernel BUG at net/core/skbuff.c"

seh commented 3 years ago

Description

On AWS EC2 instances using Flatcar Container Linux versions 2765.1.0 and 2801.1.0 from the Beta channel, with a kOps-provisioned Kubernetes installation on top, we encounter a kernel bug that causes the machines to stop and reboot immediately.

The log entries in journalctl appear as follows:

Apr 06 17:59:41 ip-10-2-1-63.eu-west-1.compute.internal kernel: ------------[ cut here ]------------
Apr 06 17:59:41 ip-10-2-1-63.eu-west-1.compute.internal kernel: kernel BUG at net/core/skbuff.c:4008!
-- Boot 8314fb086d5b4ed0a9e80895ab0c4f0b --
Apr 06 17:59:59 localhost kernel: Linux version 5.10.25-flatcar (build@pony-truck.infra.kinvolk.io) (x86_64-cros-linux-gnu-gcc (Gentoo Hardened 9.3.0-r1 p3) 9.3.0, GNU ld (Gentoo 2.35 p1) 2.35.0) #1 SMP Wed Mar 24 14:51:21 ->
lines 3257-3278

Sometimes the line number in file net/core/skbuff.c is 3,996 instead of 4,008. Usually we'll see 3,996 cited, then after the machine reboots, thereafter we'll see 4,008, suggesting that the rebooting swapped some updated files into place.

Note that we have locksmithd disabled, but update-engine is enabled, so we're downloading updates but not putting them into use eagerly.

Impact

Our fleet of Kubernetes cluster machines reboot periodically, causing the containers running on them to exit without warning and be replaced (in most cases) by the kubelet after a short delay.

Environment and steps to reproduce

Set-up:
- AWS EC2 in the "eu-west-1" region, though we've seen these a few of failures in the "us-east-2" region as well.
- Instance types we've seen fail:
- m5.xlarge
- m5.2xlarge
- m5.4xlarge
- m5a.2xlarge
- c5.xlarge
- Cluster provisioned by kOps version 1.19.1
- Kubernetes versions 1.19.8 and 1.19.9
- Cluster CNI: Calico version 3.17.3 and 3.18.1
Task:
- Kubernetes is running either control plane or worker node responsibilities.
- We have not seen this failure occur on bastion machines (instance type t3.micro) that don't run any Kubernetes components.
Action(s):
a. Launch an EC2 instance using Flatcar Container Linux, perhaps via a supervising ASG. b. Allow various Kubernetes components to start (e.g. kubelet, CNI daemons). c. Periodically check the machine's last boot time. d. Inspect system logs with a command like journalct --grep=skbuff.
Error:
The machine will hum along normally, downloading updates occasionally, and running containers for Kubernetes workload. With no warning, the machine will reboot. Subsequent inspection of the log via journalctl shows a message like this:
```
kernel: kernel BUG at net/core/skbuff.c:3996!
```
One variation:
```
kernel: kernel BUG at net/core/skbuff.c:4008!
```
After the machine boots, the /sys/fs/pstore directory mentioned here exists, but is empty. The "pstore" mount entry is as follows:
```
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel)
```
Perhaps our hardware does not support pstore, per the following uname -a output:
```
Linux ip-10-2-1-63.eu-west-1.compute.internal 5.10.25-flatcar #1 SMP Wed Mar 24 14:51:21 -00 2021 x86_64 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz GenuineIntel GNU/Linux
```

Expected behavior The machine should continue running normally without encountering errors that cause it to reboot without warning.

Additional information We run similar Kubernetes cluster in several other AWS regions:

ap-northeast-1
ap-southeast-1
us-west-2

We have not seen this failure occur in those regions. We see it predominantly in "eu-west-1" and occasionally in "us-east-2." That could be due to more intense workload in the clusters in the former region.

t-lo commented 3 years ago

Thank you for reporting @seh , we'll have a look. Do you think you could investigate into a reliable repro case for this issue?

seh commented 3 years ago

That is going to be very difficult, as so far it amounts to, "Run this Kubernetes cluster with this workload."

We have confirmed that downgrading to the Flatcar Container Linux beta version 2705.1.2 alleviates the problem. Again, versions 2765.1.0 and 2801.1.0 both suffer this same kernel bug.

Looking at the workload that runs on all the machines on which we've seen this occur, we identified only three in common:

Calico's "calico-node" daemon pod
Prometheus node exporter daemon pod
Vector's "vector-agent" daemon pod

We disabled Vector and proved that that was not the culprit. It wasn't feasible to disable "calico-node" and still have a functional Kubernetes cluster. (Swapping a CNI implementation in a production-grade cluster is a delicate operation.) We did not get as far as disabling Prometheus node exporter, though we're running it on every machine in several other Kubernetes clusters—that just happen to be less busy—so it's not likely it's at fault.

seh commented 3 years ago

I neglected to mention earlier that in our clusters where Flatcar Container Linux's locksmithd service is enabled, we don't see this bug arise. In our clusters where _updateengine is enabled but locksmithd is disabled, the bug occurs on 10-15 out of 200 machines every day.

t-lo commented 3 years ago

Interesting, thank you for sharing. While we're still looking for a solid repro the information you've provided will help with narrowing down the issue.

seh commented 3 years ago

I mentioned that on our machines where both _updateengine and locksmithd are enabled that we don't see this kernel bug arising. However, I did notice something odd in the system logs on those machines.

I've been polling our machines regularly via SSH, running a command like journalct --grep=skbuff and collecting the output, in order to see how often and on which machines the kernel bug has been occurring. On some of the machines with locksmithd enabled, I see output from that command like this:

journalctl --grep output

``` -- Journal begins at Sat 2021-02-13 23:16:07 UTC, ends at Mon 2021-04-12 21:28:47 UTC. -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- Boot bfbdc7106d5c45699668e7d2cd4ea8c5 -- -- Boot 1725d5160db347908f79b244ed63da5e -- -- No entries -- ```

Notice how it keeps flapping between two different IDs. What does that indicate?

margamanterola commented 3 years ago

Are these machines in-place-upgraded from CoreOS or were they freshly installed with Flatcar?

seh commented 3 years ago

These are fresh "installations" on EC2 instances by way of the published AMIs.