PLEG errors with v20221027, kernel 5.4.217-126.408.amzn2

ravisinha0506 commented 1 year ago

What happened: We noticed an increased rate of PLEG errors while running our workload with v20221027 AMI release. The same issue doesn't appear with v20220926.

What you expected to happen: PLEG errors shouldn't appear with latest AMI

How to reproduce it (as minimally and precisely as possible): This issue can be easily reproduced with high CPU/memory workloads

Anything else we need to know?: So far it looks like an issue associated with the latest kernel version 5.4.217-126.408.amzn2.

Environment:

AWS Region:
Instance Type(s):
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):
AMI Version:
Kernel (e.g. uname -a):
Release information (run cat /etc/eks/release on a node):

rrajeevan commented 1 year ago

@ravisinha0506 since you're tracking this, we're seeing the same issues on AMIs built on kernel-5.4.214-120.368.amzn2 as well

https://alas.aws.amazon.com/AL2/ALASKERNEL-5.4-2022-036.html

mschenk42 commented 1 year ago

Prior to this latest release we attempted to upgrade the kernel on top of the AMI Release v20220926. We experienced instability when doing this. So we decided to wait for this latest release with the kernel upgrades that patch the latest CVEs. Unfortunately we experienced the same issues. We are reasonably confident this is due to the kernel upgrade.

The issues that we are seeing now are mostly related to readiness probes using exec commands. They are failing at a much higher rate. It seems to be either a latency issue or some sort of race condition causing the failures. It also noteworthy we see these failures on high volume pods/nodes.

Kernel version
5.4.217-126.408.amzn2.x86_64
OS Image
Amazon Linux 2
Container runtime version
containerd://1.6.6
kubelet version
v1.22.12-eks-ba74326
kube-proxy version
v1.22.12-eks-ba74326
Operating system
linux
Architecture
amd64

umeat commented 1 year ago

@mschenk42 Seconding this - we experienced the same instability with v20220926 and the new kernel. We had to roll-back.

rongsheng-fang commented 1 year ago

We are also experiencing the same issue with with v20221027.

kevinsimons-wf commented 1 year ago

We have been looking into this issue as well in our clusters and have been seeing singular containers on some nodes enter a state where no docker commands complete when interacting with the container. As mentioned, this was first seen in the kernel-5.4.214-120.368.amzn2 version. I'll post some of our investigation here in case it is useful.

Node events show flapping between NodeReady and NodeNotReady

Events:
  Type    Reason        Age                     From     Message
  ----    ------        ----                    ----     -------
  Normal  NodeReady     3m11s (x15 over 4d20h)  kubelet  Node ip-10-32-145-43.ec2.internal status is now: NodeReady
  Normal  NodeNotReady  10s (x15 over 56m)      kubelet  Node ip-10-32-145-43.ec2.internal status is now: NodeNotReady

That NodeNotReady status seems to come from the K8s PLEG failing

KubeletNotReady              PLEG is not healthy: pleg was last seen active 3m7.341793557s ago; threshold is 3m0s

All docker commands with affected pod hang, but others complete just fine

.:/home/kevin_simons$ docker inspect de3577202e88
^C
.:/home/kevin_simons$ docker kill de3577202e88
^C

.:/home/kevin_simons$ docker inspect 4c53f702cd55
[
    {
        "Id": "4c53f702cd556d79c4787c96dc8f133c80a5791909822b50e2f6470517985b85",
        "Created": "2022-11-01T15:36:41.844751335Z",
        "Path": "/init/init.sh",
...

The affected container seems to have an additional runc process compared to other working containers.

.:/home/kevin_simons$ ps -aux | grep de3577202e88
root      8856  0.0  0.0 1174840 10412 ?       Sl   11:24   0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/log.json --log-format json exec --process /tmp/runc-process1560347957 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8
root     19055  0.0  0.0 712716 10500 ?        Sl   Nov01   1:04 /usr/bin/containerd-shim-runc-v2 -namespace moby -id de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8 -address /run/containerd/containerd.sock
root     19747  0.0  0.0 119424   916 pts/0    S+   19:11   0:00 grep --color=auto de3577202e88

.:/home/kevin_simons$ ps -aux | grep 6adcc2ab7a4d
root     19850  0.0  0.0 119424   960 pts/0    S+   19:11   0:00 grep --color=auto 6adcc2ab7a4d
root     27816  0.0  0.0 712204 10424 ?        Sl   Nov01   0:05 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 6adcc2ab7a4dd25bfc93f96c440c5be97cd4afe82ffbddedd9217641a7d96a66 -address /run/containerd/containerd.sock

The strace of that runc process

:/home/kevin_simons$ strace -s 800 -T -y -yy -f -p 8856
strace: Process 8856 attached with 7 threads
[pid  8868] futex(0xc000280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8864] futex(0x55ea78bd4558, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8863] futex(0xc000080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8862] futex(0x55ea78ba5db8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000009>
[pid  8861] futex(0x55ea78bd46a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8862] read(9<pipe:[43673545]>, "", 4096) = 0 <0.000009>
[pid  8860] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid  8862] epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 9<pipe:[43673545]>, 0xc00003c654 <unfinished ...>
[pid  8860] <... restart_syscall resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
[pid  8862] <... epoll_ctl resumed> )   = 0 <0.000073>
[pid  8860] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  8862] close(9<pipe:[43673545]>)   = 0 <0.000017>
[pid  8856] futex(0x55ea78ba47a8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8862] futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  8868] <... futex resumed> )       = 0 <0.000335>
[pid  8862] <... futex resumed> )       = 1 <0.000028>
[pid  8868] epoll_pwait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid  8860] <... nanosleep resumed> NULL) = 0 <0.000120>
[pid  8868] <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0 <0.000020>
[pid  8862] openat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", O_RDWR|O_CREAT|O_EXCL|O_SYNC|O_CLOEXEC, 0666 <unfinished ...>
[pid  8868] epoll_pwait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid  8862] <... openat resumed> )      = 7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid> <0.000020>
[pid  8860] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  8862] epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=678758216, u64=139793274308424}}) = -1 EPERM (Operation not permitted) <0.000011>
[pid  8862] write(7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid>, "8874", 4) = 4 <0.000015>
[pid  8860] <... nanosleep resumed> NULL) = 0 <0.000094>
[pid  8862] close(7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid> <unfinished ...>
[pid  8860] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  8862] <... close resumed> )       = 0 <0.000018>
[pid  8862] newfstatat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", 0xc0000a6928, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) <0.000012>
[pid  8862] renameat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid" <unfinished ...>
[pid  8860] <... nanosleep resumed> NULL) = 0 <0.000082>
[pid  8862] <... renameat resumed> )    = 0 <0.000020>
[pid  8860] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  8862] exit_group(0 <unfinished ...>
[pid  8868] <... epoll_pwait resumed> [{0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653847918727429924}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=180224, u64=824633901056}}, {EPOLLERR|EPOLLRDNORM|EPOLLWRNORM|0x80000, {u32=192, u64=554050781376}}, {EPOLLIN, {u32=147488, u64=824633868320}}, {0, {u32=0, u64=0}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=1, u64=94463510708225}}, {EPOLLRDBAND, {u32=16777216, u64=4311744512}}, {0, {u32=1, u64=1}}, {0, {u32=0, u64=0}}, {0, {u32=524616, u64=824634245448}}, {0, {u32=0, u64=4294967296}}, {0, {u32=1, u64=1}}, {EPOLLOUT, {u32=0, u64=17179869184}}, {0, {u32=2014802745, u64=94465525510969}}, {EPOLLNVAL|0x24000, {u32=192, u64=633438956683456}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=2014796893, u64=94465525505117}}, {EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x1905000, {u32=32548, u64=774056185986852}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=5877328870998278336}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=188416, u64=824633909248}}, {EPOLLERR|EPOLLNVAL|EPOLLWRBAND|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653867727116599076}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=16384, u64=16384}}, {EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x1905000, {u32=32548, u64=2915158371545939748}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=8652919497121857728}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=459267656, u64=139793054817864}}, {EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLWRBAND|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653873585451990820}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=8608, u64=824633729440}}, {EPOLLPRI, {u32=4, u64=11295970146910212}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=8608, u64=824633729440}}, {EPOLLNVAL|EPOLLWRBAND|EPOLLMSG|0x198000, {u32=192, u64=8654521438318887104}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=8608, u64=824633729440}}, {0, {u32=0, u64=45035996273704960}}, {0, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=11295970146910400}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=2015037669, u64=94465525745893}}, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLWRNORM|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x853c800, {u32=21994, u64=2017589886915204586}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=2043835376, u64=94465554543600}}, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLWRBAND|EPOLLRDHUP|EPOLLEXCLUSIVE|0xbffc800, {u32=32548, u64=2017588847533129508}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=2630048, u64=824636350880}}, {EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x81b0800, {u32=21994, u64=8670523657535116778}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=1365869643, u64=139793961419851}}, {0, {u32=0, u64=1972549148997582848}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=459269888, u64=139793054820096}}, {EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x6124800, {u32=765780331, u64=2017588848298877291}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=469756511, u64=139793065306719}}, {EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x9d25000, {u32=21994, u64=2017589886915204586}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=3893522394, u64=15270817745933070298}}, {EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0xe364800, {u32=3555476408, u64=3555476408}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=9331877341841850368}}, {EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|EPOLLET|0x7f49800, {u32=0, u64=0}}, {EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|0xb5fc000, {u32=32548, u64=8778206098327895844}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=1360582031, u64=139793956132239}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}], 128, -1, NULL, 0) = 231 <0.000324>
[pid  8864] <... futex resumed>)        = ?
[pid  8868] +++ exited with 0 +++
[pid  8863] <... futex resumed>)        = ?
[pid  8864] +++ exited with 0 +++
[pid  8862] <... exit_group resumed>)   = ?
[pid  8863] +++ exited with 0 +++
[pid  8861] <... futex resumed>)        = ?
[pid  8862] +++ exited with 0 +++
[pid  8861] +++ exited with 0 +++
[pid  8860] <... nanosleep resumed> <unfinished ...>) = ?
[pid  8856] <... futex resumed>)        = ?
[pid  8860] +++ exited with 0 +++
+++ exited with 0 +++

We then see that after running strace on that runc process, the container is automatically cleaned up and the node seems to go back to a Ready state.

.:/home/kevin_simons$ docker inspect de3577202e88
[]
Error: No such object: de3577202e88

At this point I hopped on another broken node since that one was cleaned up.

Running strace on the containerd shim process

.:/home/kevin_simons$ ps -aux | grep 3a2178477417
root     11587  0.0  0.0 712460 11564 ?        Sl   Nov01   0:44 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9 -address /run/containerd/containerd.sock
root     21193  0.0  0.0 1174584 10388 ?       Sl   05:28   0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9/log.json --log-format json exec --process /tmp/runc-process119335193 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9/3a7ac34beeea2c994a83dd8677b15b9c90045bf1d193f484b7ccf8d6106066e1.pid 3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9
root     32682  0.0  0.0 119424   924 pts/0    S+   19:18   0:00 grep --color=auto 3a2178477417

:/home/kevin_simons$ strace -s 800 -T -y -yy -f -p 11587
strace: Process 11587 attached with 12 threads
[pid 21281] epoll_pwait(5<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 11668] futex(0xc000080d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 21281] <... epoll_pwait resumed> [], 128, 0, NULL, 194962698) = 0 <0.000032>
[pid 21281] epoll_pwait(5<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 11597] futex(0xc000080948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11596] futex(0xd26e60, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11595] futex(0xcf8658, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000012>
[pid 11594] futex(0xc000080548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11595] epoll_wait(9<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 11593] futex(0xd26d18, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11592] futex(0xc000080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11591] futex(0xc000042948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11590] epoll_wait(10<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 11589] restart_syscall(<... resuming interrupted epoll_wait ...> <unfinished ...>
[pid 11587] futex(0xcf7a68, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11589] <... restart_syscall resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
[pid 11589] futex(0xc000080d48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 11668] <... futex resumed> )       = 0 <0.000443>
[pid 11589] <... futex resumed> )       = 1 <0.000036>
[pid 11668] sched_yield( <unfinished ...>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 11668] <... sched_yield resumed> ) = 0 <0.000027>
[pid 11668] futex(0xcf6dc0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000014>
[pid 11668] futex(0xc000080d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11589] <... nanosleep resumed> NULL) = 0 <0.000093>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000090>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000095>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
...

That then loops through the FUTEX_WAIT_PRIVATE calls indefinitely.

Generating stack trace of docker which generates an extremely large file

.:/home/kevin_simons$ kill -SIGUSR1 $(pidof dockerd)

.:/home/kevin_simons$ journalctl -u docker.service -r
Nov 02 19:22:15 ip-10-32-134-101.ec2.internal dockerd[7078]: time="2022-11-02T19:22:15.973566811Z" level=info msg="goroutine stacks written to /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log"
...

.:/home/kevin_simons$ ls -al /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log
-rw-r--r-- 1 root root 10306382 Nov  2 19:22 /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log

Inside the stack trace we see what seems like thousands of goroutines similar to

goroutine 9403905 [semacquire, 528 minutes]:
sync.runtime_SemacquireMutex(0xc00103f204, 0x0, 0x1)
        /usr/lib/golang/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc00103f200)
        /usr/lib/golang/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/golang/src/sync/mutex.go:81
github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc00000c1e0, 0xc012072016, 0x40, 0x0, 0x1, 0xc019970018, 0xc0096c22e0)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/daemon/inspect.go:40 +0xa30
github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0xc00000c1e0, 0xc012072016, 0x40, 0x0, 0xc012072006, 0x4, 0xc014d321b0, 0x562e8d0ca91d, 0x7f20140ff498, 0x8)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/daemon/inspect.go:29 +0x11b
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc000c2f480, 0x562e8f7ab640, 0xc0127a6f30, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0xc00bb4c801, 0xc01227f4a0)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0x116
github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f30, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x562e8f7ab640, 0xc0127a6f30)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0x17d
github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f00, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x203002, 0x203002)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x617
github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f00, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x562e8f7ab640, 0xc0127a6f00)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x822
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x562e8f79b680, 0xc00d82c2a0, 0xc01a318600)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/server.go:141 +0x258
net/http.HandlerFunc.ServeHTTP(0xc00092eda0, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318600)
        /usr/lib/golang/src/net/http/server.go:2042 +0x46
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bcea80, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318400)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:210 +0xd3
net/http.serverHandler.ServeHTTP(0xc0001fa000, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318400)
        /usr/lib/golang/src/net/http/server.go:2843 +0xa5
net/http.(*conn).serve(0xc014baa140, 0x562e8f7ab580, 0xc00cb5e980)
        /usr/lib/golang/src/net/http/server.go:1925 +0x8ad
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:2969 +0x36c

The oldest of these semacquire for this was 834 minutes so we grabbed the other goroutines in that dump that were at that time

.:/home/kevin_simons$ cat /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log | grep goroutine | grep -v semacquire | grep 834
goroutine 9233430 [sync.Cond.Wait, 834 minutes]:
goroutine 9233295 [sync.Cond.Wait, 834 minutes]:
goroutine 9233431 [sync.Cond.Wait, 834 minutes]:
goroutine 9233297 [select, 834 minutes]:
goroutine 9233465 [select, 834 minutes]:
goroutine 9233462 [select, 834 minutes]:
goroutine 9233464 [select, 834 minutes]:
goroutine 9233427 [select, 834 minutes]:
goroutine 9233432 [select, 834 minutes]:
goroutine 9233461 [syscall, 834 minutes]:
goroutine 9233463 [syscall, 834 minutes]:
goroutine 9233296 [sync.Cond.Wait, 834 minutes]:
goroutine 9233460 [select, 834 minutes]:
goroutine 9233408 [select, 834 minutes]:

The two syscall routines there show this

.:/home/kevin_simons$ cat /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log | grep -A 20 "syscall, 834"
goroutine 9233461 [syscall, 834 minutes]:
syscall.Syscall6(0x101, 0xffffffffffffff9c, 0xc00277e140, 0x80000, 0x0, 0x0, 0x0, 0x2eb4037, 0x1, 0x11c0)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
syscall.openat(0xffffffffffffff9c, 0xc00277e120, 0x11, 0x80000, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:68 +0xc8
syscall.Open(...)
        /usr/lib/golang/src/syscall/syscall_linux.go:152
os.openFileNolog(0xc00277e120, 0x11, 0x0, 0x0, 0x1723ad5a3241959c, 0xc002fc8fa0, 0x562e8d52c24a)
        /usr/lib/golang/src/os/file_unix.go:200 +0x91
os.OpenFile(0xc00277e120, 0x11, 0x0, 0x0, 0x0, 0xc00562fdf8, 0xc00562fce0)
        /usr/lib/golang/src/os/file.go:327 +0x65
github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo.func2(0x0, 0xc002300740, 0x0, 0xc003dc13e0, 0x562e8f7ab580, 0xc002300700)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:130 +0x371
created by github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:123 +0x265

--
goroutine 9233463 [syscall, 834 minutes]:
syscall.Syscall6(0x101, 0xffffffffffffff9c, 0xc000b4c6a0, 0x80000, 0x0, 0x0, 0x0, 0x2eb4038, 0x1, 0x11c0)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
syscall.openat(0xffffffffffffff9c, 0xc000b4c360, 0x11, 0x80000, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:68 +0xc8
syscall.Open(...)
        /usr/lib/golang/src/syscall/syscall_linux.go:152
os.openFileNolog(0xc000b4c360, 0x11, 0x0, 0xc000000000, 0xc002011410, 0xc00275ed80, 0xc00499bfb0)
        /usr/lib/golang/src/os/file_unix.go:200 +0x91
os.OpenFile(0xc000b4c360, 0x11, 0x0, 0x0, 0x0, 0x2000100000000, 0xc003043aa0)
        /usr/lib/golang/src/os/file.go:327 +0x65
github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo.func2(0x0, 0xc002300780, 0x0, 0xc003dc14a0, 0x562e8f7ab580, 0xc002300700)
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:130 +0x371
created by github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo
        /builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:123 +0x265

As far as I found there was now way to display what file was actually being opened with that syscall.Open and this is where we discovered the kernel bump in our AMI and eventually to this issue and the corresponding comment in aws repo pinning the kernel version https://github.com/awslabs/amazon-eks-ami/blob/1e89a4483c5bd2c34fcdf89d625fbd56085d83cf/scripts/upgrade_kernel.sh#L16 .

In order to trigger this issue we've been executing a kubectl rollout restart ... of one of our daemonsets to tear down a pod on every node.

Hopefully some of this may be useful in tracking down the issue.

dcodix commented 1 year ago

@kevinsimons-wf I have been seeing similar issues. Also followed a similar debug process. But I also have seen the issue happening without any container in a bad state.

In all cases tho (maybe you can check it out), before we start seeing the PLEG error we see a bunch of log lines containing the string fsHandler.go:119] failed to collect filesystem stats - rootDiskErr: could not stat . right after those PLEG error appears.

Finally (I have not been able to verify it is the cause, but it could make sense) this error seems to happen within under 10 minutes from a new network interface is added, see dmesg -T for lines such as :

[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: [1d0f:ec20] type 00 class 0x020000
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: reg 0x10: [mem 0x00000000-0x00001fff]
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: reg 0x14: [mem 0x00000000-0x00001fff]
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: enabling Extended Tags
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: BAR 2: assigned [mem 0xc0300000-0xc03fffff pref]
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: BAR 0: assigned [mem 0xc0108000-0xc0109fff]
[Tue Nov  1 17:55:12 2022] pci 0000:00:08.0: BAR 1: assigned [mem 0xc010a000-0xc010bfff]
[Tue Nov  1 17:55:12 2022] ena 0000:00:08.0: enabling device (0000 -> 0002)
[Tue Nov  1 17:55:12 2022] ena 0000:00:08.0: ENA device version: 0.10
[Tue Nov  1 17:55:12 2022] ena 0000:00:08.0: ENA controller version: 0.0.1 implementation version 1
[Tue Nov  1 17:55:13 2022] ena 0000:00:08.0: Elastic Network Adapter (ENA) found at mem c0108000, mac addr 0a:09:cd:c5:0d:f5
[Tue Nov  1 17:55:16 2022] ena 0000:00:08.0 eth3: Local page cache is disabled for less than 16 channels

It could happen that there was a blip on the network that caused kubelet to fail reading the disk and entered in this bad state. At the same time, the same blip could be causing other processes to enter in bad state (I actually had a bunch of aws-node subprocesses in Z state)

Same as you, hopefully this might help solve the issue...

mmerkes commented 1 year ago

AmazonLinux has released kernel 5.4.219-126.411.amzn2 yesterday, which we expect to solve the issues described here. We've tested it with our reproduction of the issue, and it seems to work. To solve this issue, you can do one of the following:

If you build a custom AMI, build a new AMI with the latest commit (or at least exclude the commit pinning the kernel version) and you should get the latest kernel.
Release v20221101 does not have the problematic kernel patch, so it should not have the same issue reported here. It's just running an older kernel version, so switching to that AMI is another option.
- EKS is working on publishing new AMIs, which will be release version v20221103 or later. We will update this repo as soon as those AMIs are released. At the moment, it's looking like 11/7/22 is when they'll be available.
Lastly, you can manually upgrade the kernel on your nodes. If you're using release v20221101, you'll need to unpin the kernel. I don't recommend this approach unless it's very important that you get on the latest kernel.

mmerkes commented 1 year ago

Keeping open until we release v20221103 or later.

617m4rc commented 1 year ago

The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already.

mmerkes commented 1 year ago

FYI, updated mitigation recommendation above to call out that v20221101 is running on an older kernel version that doesn't have the problem described here.

pie-r commented 1 year ago

The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already.

Same here.....

zekena2 commented 1 year ago

Hi, I see that the patched kernel mentioned is available. @mmerkes When should the new ami release come out?

pie-r commented 1 year ago

Hi, I see that the patched kernel mentioned is available. @mmerkes When should the new ami release come out?

is already out the new image.. Look to this comment: https://github.com/awslabs/amazon-eks-ami/issues/1071#issuecomment-1303776788

zekena2 commented 1 year ago

The latest AMI doesn't have this kernel 5.4.219-126.411.amzn2.

scoof commented 1 year ago

Are there any release notes for kernel 5.4.219-126.411.amzn2?

anish commented 1 year ago

5.4.219-126.411.amzn2 seems available with release v20221104

zekena2 commented 1 year ago

Is it written anywhere what was the root cause and what was the fix? I'm interested.

617m4rc commented 1 year ago

Doing a proper post incident review should be a standard procedure. I am negatively surprised that it is not taking place.

mmerkes commented 1 year ago

Sorry, I was OOTO, so wasn't able to update here. v20221104 was released with the latest kernel, 5.4.219-126.411.amzn2, which resolves this issue.

When issues like this happen, it is our standard practice to root cause the issue, mitigate and review our processes to avoid similar issues in the future. In this case, the process is ongoing. We have developed a test that reproduces this particular issue and have added that as part of our AMI release process. As for more details on the root cause, I'll post here if AmazonLinux has released anything with more details. In a nutshell, a commit on the kernel was causing some workloads to hang, which was reverted on the latest kernel.

chadlwilson commented 1 year ago

With a bit of guesswork, I suspect the above issue referred to was https://github.com/torvalds/linux/commit/72d9ce5b085a1e7374f4aedc0892bd3b909f25c9 from 5.4.211 which was reverted in 5.4.219?

alexkuretz commented 1 year ago

I updated our clusters Tue night and things seemed fine, but today I'm seeing nodes flapping with NotReady state again with PLEG errors, and pods left in a Terminating state: PLEG is not healthy: pleg was last seen active 8m16.90989123s ago; threshold is 3m0s

Kernel Version: 5.4.219-126.411.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: docker://20.10.17 Kubelet Version: v1.22.12-eks-ba74326 Kube-Proxy Version: v1.22.12-eks-ba74326

tzneal commented 1 year ago

You could diagnose the previous issue by running ps a few times and seeing a runc init process that was hung. Do you see that on your node?

alexkuretz commented 1 year ago

Thanks for the quick reply. I didn't see those processes on my nodes. Today I installed an efs-csi driver update that was quickly updated to a new release today, perhaps that was causing my issues. I spun up all new nodes in my clusters and things seem stable now.

awslabs / amazon-eks-ami

PLEG errors with v20221027, kernel 5.4.217-126.408.amzn2 #1071