Unresponsive/unreachable Bottlerocket EKS nodes

Hey folks,

Coming to you with an odd issue with Bottlerocket EKS nodes becoming unresponsive and unreachable, but still happily running as seen by EC2.

Image I'm using:

ami-089e696e7c541c61b (amazon/bottlerocket-aws-k8s-1.29-aarch64-v1.20.2-536d69d0)

System Info:
  Kernel Version:             6.1.90
  OS Image:                   Bottlerocket OS 1.20.2 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.29.1-eks-61c0bbb
  Kube-Proxy Version:         v1.29.1-eks-61c0bbb

We use Karpenter as a node provisioner.

What I expected to happen:

Node runs smoothly.

What actually happened:

A large node (m7g.8xlarge) is running a memory-heavy JVM workload from a StatefulSet that takes up most of the machine (request=115Gi out of 128Gi).

Node runs happily for a while (~ 24h to several days), until it suddenly stops reporting status to Kubernetes:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

Kubernetes marks that node as NotReady and taints it with node.kubernetes.io/unreachable, then tries to delete the pod running our large workload. That doesn't work (the node is fully unresponsive), so that pod is stuck in status Terminating.

The node is unreachable via kubectl debug node ... (which times out), or via AWS SSM (which complains of "SSM Agent is not online", "Ping status: Connection lost"). The EC2 Serial Console is empty (all black). We do not have SSH enabled on these machines.

However, the node still appears as running from the EC2 console, reachability metrics are green, and we can see CPU / network metrics flowing in.

I've managed to get system logs by taking an EBS snapshot of the Bottlerocket data volume and restoring it to a separate volume for investigation. This was not helpful unfortunately: logs appear normal until the time the node dies, then suddenly stop. There is no indication that anything in particular (kubelet, containerd, etc) crashed and brought the instance down, but also, suspisciously, no logs at all from the moment the node went unresponsive.

How to reproduce the problem:

No clear way to reproduce unfortunately; we've seen this happen sporadically on maybe half a dozen instances over the past few weeks, out of several hundred

This issue is very annoying: I don't mind having pods crashing and/or getting evicted sometimes, or even kubelet/containerd crashing, but I'd expect it to self-heal eventually. This causes our workloads to get stuck, and we have to manually delete the pod and/or the node to get it back to normal. But even worse, I can't see a way to debug it properly or get to the bottom of it.

Would you have any idea of a better way to debug this?

Thank you!

(Note: I also opened an AWS support ticket to see if there's any AWS-level issue at play here, but this seems to happen only on this specific workload on Bottlerocket nodes, so I suspect something is off here)

Thanks for opening this issue @Pluies. This does look concerning and we would like to get to the bottom of this! I'll look internally for that support ticket to see if we can find out what is happening.

Okay, I've got good news - after 13 days of being in this state, one node came back to life! Well, kind of. We received a NodeClockNotSynchronising alert, and realised the prometheus node exporter started responding to scrapes again (???).

Apart from that, no change. The node is still showing as NotReady and unreachable, not reachable via kubectl debug or AWS SSM, etc.

I've taken a new snapshot of the bottlerocket data volume, and I'm seeing new logs in the journal though; attaching it here. journal-after-rebirth.log

I think this has the meat of it (kernel messages + OOM kills).

There are some more messages later showing kubelet errors; I assume the node is so far gone at this stage that it's unable to refresh its credentials and rejoin the cluster.

Those logs do help. It looks like this could be related to an EBS availability since it seems to be related to hanging tasks with the storage driver:

INFO: task jbd2/nvme2n1-8:6604 blocked for more than 122 seconds.
kernel: fbcon: Taking over console
kernel:       Not tainted 6.1.90 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:jbd2/nvme2n1-8  state:D stack:0     pid:6604  ppid:2      flags:0x00000008
kernel: Call trace:
kernel:  __switch_to+0xbc/0xf0
kernel:  __schedule+0x258/0x6e0
kernel:  schedule+0x58/0xf0
kernel:  jbd2_journal_wait_updates+0x88/0x130
kernel:  jbd2_journal_commit_transaction+0x264/0x1bc0
kernel:  kjournald2+0x15c/0x470
kernel:  kthread+0xd0/0xe0

Can you confirm how many instances this happened to and what storage type you are using for your volumes? I'll keep digging in as best I can as well!

Sure, storage-wise the machine has 4 EBS volumes attached:

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
For our workload: two large PVCs provisioned by the AWS EBS CSI driver. Both are gp3, one is formatted as ext4 (2000GB), the other is xfs (500GB)
Additionally, the workload mounts an EFS volume and it's configured to write a java heap dump to this volume in case of Java OOM

To keep everyone in sync, AWS Support seems to think this is due to memory pressure. The large workload has no memory limit; so this is definitely a possibility. We've been working off the assumption that if they do use too much memory, we would still be protected by the OOMKiller and other limits such as kube-reserved and system-reserved memory; but I realise this assumption might be wrong.

The metrics also showed this re memory pressure:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 683567.871622
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 3823.9046439999997
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 18942.701305
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 1.141417592027e+06
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 1.141678870877e+06

The memory-related stall, 1.141678870877e+06 seconds, corresponds to just over 13 days, which corresponds pretty well to how long the node was dead for. This might just be a coincidence, if this is shared by all processes, but interesting nonetheless.

On another node (not dead 😄 ) with the same workload, these metrics come up as:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 10647.531140000001
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 5830.272114
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 10842.729424000001
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 4560.770407999999
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 4941.830446999999

Anyway, I'm now adding a memory limit to the large workload, so if this is memory-related it should fix the issue for us. 👍

If this is the case though, is it still potentially a bottlerocket issue? I'd have expected either a full crash or eventual self-healing, but we ended up in this half-broken state instead. On the other hand, it would also be reasonable for Bottlerocket to not support unbounded workloads.

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)

One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.

AWS Support seems to think this is due to memory pressure.

I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?

bottlerocket-os / bottlerocket

Unresponsive/unreachable Bottlerocket EKS nodes #4075