Closed ravisinha0506 closed 1 year ago
@ravisinha0506 since you're tracking this, we're seeing the same issues on AMIs built on kernel-5.4.214-120.368.amzn2
as well
https://alas.aws.amazon.com/AL2/ALASKERNEL-5.4-2022-036.html
Prior to this latest release we attempted to upgrade the kernel on top of the AMI Release v20220926. We experienced instability when doing this. So we decided to wait for this latest release with the kernel upgrades that patch the latest CVEs. Unfortunately we experienced the same issues. We are reasonably confident this is due to the kernel upgrade.
The issues that we are seeing now are mostly related to readiness probes using exec commands. They are failing at a much higher rate. It seems to be either a latency issue or some sort of race condition causing the failures. It also noteworthy we see these failures on high volume pods/nodes.
Kernel version
5.4.217-126.408.amzn2.x86_64
OS Image
Amazon Linux 2
Container runtime version
containerd://1.6.6
kubelet version
v1.22.12-eks-ba74326
kube-proxy version
v1.22.12-eks-ba74326
Operating system
linux
Architecture
amd64
@mschenk42 Seconding this - we experienced the same instability with v20220926 and the new kernel. We had to roll-back.
We are also experiencing the same issue with with v20221027.
We have been looking into this issue as well in our clusters and have been seeing singular containers on some nodes enter a state where no docker commands complete when interacting with the container. As mentioned, this was first seen in the kernel-5.4.214-120.368.amzn2
version. I'll post some of our investigation here in case it is useful.
Node events show flapping between NodeReady and NodeNotReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeReady 3m11s (x15 over 4d20h) kubelet Node ip-10-32-145-43.ec2.internal status is now: NodeReady
Normal NodeNotReady 10s (x15 over 56m) kubelet Node ip-10-32-145-43.ec2.internal status is now: NodeNotReady
That NodeNotReady status seems to come from the K8s PLEG failing
KubeletNotReady PLEG is not healthy: pleg was last seen active 3m7.341793557s ago; threshold is 3m0s
All docker commands with affected pod hang, but others complete just fine
.:/home/kevin_simons$ docker inspect de3577202e88
^C
.:/home/kevin_simons$ docker kill de3577202e88
^C
.:/home/kevin_simons$ docker inspect 4c53f702cd55
[
{
"Id": "4c53f702cd556d79c4787c96dc8f133c80a5791909822b50e2f6470517985b85",
"Created": "2022-11-01T15:36:41.844751335Z",
"Path": "/init/init.sh",
...
The affected container seems to have an additional runc
process compared to other working containers.
.:/home/kevin_simons$ ps -aux | grep de3577202e88
root 8856 0.0 0.0 1174840 10412 ? Sl 11:24 0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/log.json --log-format json exec --process /tmp/runc-process1560347957 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8
root 19055 0.0 0.0 712716 10500 ? Sl Nov01 1:04 /usr/bin/containerd-shim-runc-v2 -namespace moby -id de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8 -address /run/containerd/containerd.sock
root 19747 0.0 0.0 119424 916 pts/0 S+ 19:11 0:00 grep --color=auto de3577202e88
.:/home/kevin_simons$ ps -aux | grep 6adcc2ab7a4d
root 19850 0.0 0.0 119424 960 pts/0 S+ 19:11 0:00 grep --color=auto 6adcc2ab7a4d
root 27816 0.0 0.0 712204 10424 ? Sl Nov01 0:05 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 6adcc2ab7a4dd25bfc93f96c440c5be97cd4afe82ffbddedd9217641a7d96a66 -address /run/containerd/containerd.sock
The strace
of that runc process
:/home/kevin_simons$ strace -s 800 -T -y -yy -f -p 8856
strace: Process 8856 attached with 7 threads
[pid 8868] futex(0xc000280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8864] futex(0x55ea78bd4558, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8863] futex(0xc000080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8862] futex(0x55ea78ba5db8, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000009>
[pid 8861] futex(0x55ea78bd46a0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8862] read(9<pipe:[43673545]>, "", 4096) = 0 <0.000009>
[pid 8860] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 8862] epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 9<pipe:[43673545]>, 0xc00003c654 <unfinished ...>
[pid 8860] <... restart_syscall resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
[pid 8862] <... epoll_ctl resumed> ) = 0 <0.000073>
[pid 8860] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 8862] close(9<pipe:[43673545]>) = 0 <0.000017>
[pid 8856] futex(0x55ea78ba47a8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8862] futex(0xc000280148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 8868] <... futex resumed> ) = 0 <0.000335>
[pid 8862] <... futex resumed> ) = 1 <0.000028>
[pid 8868] epoll_pwait(4<anon_inode:[eventpoll]>, <unfinished ...>
[pid 8860] <... nanosleep resumed> NULL) = 0 <0.000120>
[pid 8868] <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0 <0.000020>
[pid 8862] openat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", O_RDWR|O_CREAT|O_EXCL|O_SYNC|O_CLOEXEC, 0666 <unfinished ...>
[pid 8868] epoll_pwait(4<anon_inode:[eventpoll]>, <unfinished ...>
[pid 8862] <... openat resumed> ) = 7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid> <0.000020>
[pid 8860] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 8862] epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=678758216, u64=139793274308424}}) = -1 EPERM (Operation not permitted) <0.000011>
[pid 8862] write(7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid>, "8874", 4) = 4 <0.000015>
[pid 8860] <... nanosleep resumed> NULL) = 0 <0.000094>
[pid 8862] close(7</run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid> <unfinished ...>
[pid 8860] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 8862] <... close resumed> ) = 0 <0.000018>
[pid 8862] newfstatat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", 0xc0000a6928, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) <0.000012>
[pid 8862] renameat(AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/.4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid", AT_FDCWD, "/run/containerd/io.containerd.runtime.v2.task/moby/de3577202e889e348167c45b691e8b8401670f0307d6e947f44d2325752f7ed8/4c05038461d26b103f30fbff51626f9ab0a44e128d4575a23172318776df772e.pid" <unfinished ...>
[pid 8860] <... nanosleep resumed> NULL) = 0 <0.000082>
[pid 8862] <... renameat resumed> ) = 0 <0.000020>
[pid 8860] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 8862] exit_group(0 <unfinished ...>
[pid 8868] <... epoll_pwait resumed> [{0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653847918727429924}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=180224, u64=824633901056}}, {EPOLLERR|EPOLLRDNORM|EPOLLWRNORM|0x80000, {u32=192, u64=554050781376}}, {EPOLLIN, {u32=147488, u64=824633868320}}, {0, {u32=0, u64=0}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=1, u64=94463510708225}}, {EPOLLRDBAND, {u32=16777216, u64=4311744512}}, {0, {u32=1, u64=1}}, {0, {u32=0, u64=0}}, {0, {u32=524616, u64=824634245448}}, {0, {u32=0, u64=4294967296}}, {0, {u32=1, u64=1}}, {EPOLLOUT, {u32=0, u64=17179869184}}, {0, {u32=2014802745, u64=94465525510969}}, {EPOLLNVAL|0x24000, {u32=192, u64=633438956683456}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=2014796893, u64=94465525505117}}, {EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x1905000, {u32=32548, u64=774056185986852}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=5877328870998278336}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=188416, u64=824633909248}}, {EPOLLERR|EPOLLNVAL|EPOLLWRBAND|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653867727116599076}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=16384, u64=16384}}, {EPOLLERR|EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x1905000, {u32=32548, u64=2915158371545939748}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=8652919497121857728}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=459267656, u64=139793054817864}}, {EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLWRBAND|EPOLLMSG|EPOLLEXCLUSIVE|0xb5fd800, {u32=32548, u64=8653873585451990820}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=8608, u64=824633729440}}, {EPOLLPRI, {u32=4, u64=11295970146910212}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=8608, u64=824633729440}}, {EPOLLNVAL|EPOLLWRBAND|EPOLLMSG|0x198000, {u32=192, u64=8654521438318887104}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=8608, u64=824633729440}}, {0, {u32=0, u64=45035996273704960}}, {0, {u32=0, u64=0}}, {EPOLLNVAL|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|0x280000, {u32=192, u64=11295970146910400}}, {EPOLLRDNORM|EPOLLRDBAND, {u32=2015037669, u64=94465525745893}}, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLWRNORM|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x853c800, {u32=21994, u64=2017589886915204586}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=2043835376, u64=94465554543600}}, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLWRBAND|EPOLLRDHUP|EPOLLEXCLUSIVE|0xbffc800, {u32=32548, u64=2017588847533129508}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=2630048, u64=824636350880}}, {EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x81b0800, {u32=21994, u64=8670523657535116778}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=1365869643, u64=139793961419851}}, {0, {u32=0, u64=1972549148997582848}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=459269888, u64=139793054820096}}, {EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0x6124800, {u32=765780331, u64=2017588848298877291}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=469756511, u64=139793065306719}}, {EPOLLHUP|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLWAKEUP|EPOLLONESHOT|0x9d25000, {u32=21994, u64=2017589886915204586}}, {EPOLLOUT|EPOLLNVAL|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|0x5800, {u32=3893522394, u64=15270817745933070298}}, {EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|EPOLLONESHOT|0xe364800, {u32=3555476408, u64=3555476408}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=9331877341841850368}}, {EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLRDHUP|EPOLLET|0x7f49800, {u32=0, u64=0}}, {EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLRDHUP|EPOLLEXCLUSIVE|0xb5fc000, {u32=32548, u64=8778206098327895844}}, {EPOLLPRI|EPOLLERR|EPOLLNVAL|EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLMSG|0x5000, {u32=1360582031, u64=139793956132239}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}, {0, {u32=0, u64=0}}], 128, -1, NULL, 0) = 231 <0.000324>
[pid 8864] <... futex resumed>) = ?
[pid 8868] +++ exited with 0 +++
[pid 8863] <... futex resumed>) = ?
[pid 8864] +++ exited with 0 +++
[pid 8862] <... exit_group resumed>) = ?
[pid 8863] +++ exited with 0 +++
[pid 8861] <... futex resumed>) = ?
[pid 8862] +++ exited with 0 +++
[pid 8861] +++ exited with 0 +++
[pid 8860] <... nanosleep resumed> <unfinished ...>) = ?
[pid 8856] <... futex resumed>) = ?
[pid 8860] +++ exited with 0 +++
+++ exited with 0 +++
We then see that after running strace
on that runc
process, the container is automatically cleaned up and the node seems to go back to a Ready state.
.:/home/kevin_simons$ docker inspect de3577202e88
[]
Error: No such object: de3577202e88
At this point I hopped on another broken node since that one was cleaned up.
Running strace on the containerd shim process
.:/home/kevin_simons$ ps -aux | grep 3a2178477417
root 11587 0.0 0.0 712460 11564 ? Sl Nov01 0:44 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9 -address /run/containerd/containerd.sock
root 21193 0.0 0.0 1174584 10388 ? Sl 05:28 0:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9/log.json --log-format json exec --process /tmp/runc-process119335193 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9/3a7ac34beeea2c994a83dd8677b15b9c90045bf1d193f484b7ccf8d6106066e1.pid 3a2178477417e2346548c23dd78a085b16ce0b333dad1943ef87ed581f2c6af9
root 32682 0.0 0.0 119424 924 pts/0 S+ 19:18 0:00 grep --color=auto 3a2178477417
:/home/kevin_simons$ strace -s 800 -T -y -yy -f -p 11587
strace: Process 11587 attached with 12 threads
[pid 21281] epoll_pwait(5<anon_inode:[eventpoll]>, <unfinished ...>
[pid 11668] futex(0xc000080d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 21281] <... epoll_pwait resumed> [], 128, 0, NULL, 194962698) = 0 <0.000032>
[pid 21281] epoll_pwait(5<anon_inode:[eventpoll]>, <unfinished ...>
[pid 11597] futex(0xc000080948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11596] futex(0xd26e60, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11595] futex(0xcf8658, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000012>
[pid 11594] futex(0xc000080548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11595] epoll_wait(9<anon_inode:[eventpoll]>, <unfinished ...>
[pid 11593] futex(0xd26d18, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11592] futex(0xc000080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11591] futex(0xc000042948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11590] epoll_wait(10<anon_inode:[eventpoll]>, <unfinished ...>
[pid 11589] restart_syscall(<... resuming interrupted epoll_wait ...> <unfinished ...>
[pid 11587] futex(0xcf7a68, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11589] <... restart_syscall resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
[pid 11589] futex(0xc000080d48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 11668] <... futex resumed> ) = 0 <0.000443>
[pid 11589] <... futex resumed> ) = 1 <0.000036>
[pid 11668] sched_yield( <unfinished ...>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 11668] <... sched_yield resumed> ) = 0 <0.000027>
[pid 11668] futex(0xcf6dc0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000014>
[pid 11668] futex(0xc000080d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 11589] <... nanosleep resumed> NULL) = 0 <0.000093>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000090>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000092>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000095>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
[pid 11589] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 <0.000091>
...
That then loops through the FUTEX_WAIT_PRIVATE
calls indefinitely.
Generating stack trace of docker which generates an extremely large file
.:/home/kevin_simons$ kill -SIGUSR1 $(pidof dockerd)
.:/home/kevin_simons$ journalctl -u docker.service -r
Nov 02 19:22:15 ip-10-32-134-101.ec2.internal dockerd[7078]: time="2022-11-02T19:22:15.973566811Z" level=info msg="goroutine stacks written to /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log"
...
.:/home/kevin_simons$ ls -al /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log
-rw-r--r-- 1 root root 10306382 Nov 2 19:22 /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log
Inside the stack trace we see what seems like thousands of goroutines similar to
goroutine 9403905 [semacquire, 528 minutes]:
sync.runtime_SemacquireMutex(0xc00103f204, 0x0, 0x1)
/usr/lib/golang/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc00103f200)
/usr/lib/golang/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
/usr/lib/golang/src/sync/mutex.go:81
github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc00000c1e0, 0xc012072016, 0x40, 0x0, 0x1, 0xc019970018, 0xc0096c22e0)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/daemon/inspect.go:40 +0xa30
github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0xc00000c1e0, 0xc012072016, 0x40, 0x0, 0xc012072006, 0x4, 0xc014d321b0, 0x562e8d0ca91d, 0x7f20140ff498, 0x8)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/daemon/inspect.go:29 +0x11b
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc000c2f480, 0x562e8f7ab640, 0xc0127a6f30, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0xc00bb4c801, 0xc01227f4a0)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0x116
github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f30, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x562e8f7ab640, 0xc0127a6f30)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0x17d
github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f00, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x203002, 0x203002)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x617
github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1(0x562e8f7ab640, 0xc0127a6f00, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318700, 0xc0127a6e70, 0x562e8f7ab640, 0xc0127a6f00)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x822
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x562e8f79b680, 0xc00d82c2a0, 0xc01a318600)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/api/server/server.go:141 +0x258
net/http.HandlerFunc.ServeHTTP(0xc00092eda0, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318600)
/usr/lib/golang/src/net/http/server.go:2042 +0x46
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000bcea80, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318400)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:210 +0xd3
net/http.serverHandler.ServeHTTP(0xc0001fa000, 0x562e8f79b680, 0xc00d82c2a0, 0xc01a318400)
/usr/lib/golang/src/net/http/server.go:2843 +0xa5
net/http.(*conn).serve(0xc014baa140, 0x562e8f7ab580, 0xc00cb5e980)
/usr/lib/golang/src/net/http/server.go:1925 +0x8ad
created by net/http.(*Server).Serve
/usr/lib/golang/src/net/http/server.go:2969 +0x36c
The oldest of these semacquire
for this was 834 minutes so we grabbed the other goroutines in that dump that were at that time
.:/home/kevin_simons$ cat /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log | grep goroutine | grep -v semacquire | grep 834
goroutine 9233430 [sync.Cond.Wait, 834 minutes]:
goroutine 9233295 [sync.Cond.Wait, 834 minutes]:
goroutine 9233431 [sync.Cond.Wait, 834 minutes]:
goroutine 9233297 [select, 834 minutes]:
goroutine 9233465 [select, 834 minutes]:
goroutine 9233462 [select, 834 minutes]:
goroutine 9233464 [select, 834 minutes]:
goroutine 9233427 [select, 834 minutes]:
goroutine 9233432 [select, 834 minutes]:
goroutine 9233461 [syscall, 834 minutes]:
goroutine 9233463 [syscall, 834 minutes]:
goroutine 9233296 [sync.Cond.Wait, 834 minutes]:
goroutine 9233460 [select, 834 minutes]:
goroutine 9233408 [select, 834 minutes]:
The two syscall
routines there show this
.:/home/kevin_simons$ cat /var/run/docker/goroutine-stacks-2022-11-02T192215Z.log | grep -A 20 "syscall, 834"
goroutine 9233461 [syscall, 834 minutes]:
syscall.Syscall6(0x101, 0xffffffffffffff9c, 0xc00277e140, 0x80000, 0x0, 0x0, 0x0, 0x2eb4037, 0x1, 0x11c0)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
syscall.openat(0xffffffffffffff9c, 0xc00277e120, 0x11, 0x80000, 0x0, 0x0, 0x0, 0x0)
/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:68 +0xc8
syscall.Open(...)
/usr/lib/golang/src/syscall/syscall_linux.go:152
os.openFileNolog(0xc00277e120, 0x11, 0x0, 0x0, 0x1723ad5a3241959c, 0xc002fc8fa0, 0x562e8d52c24a)
/usr/lib/golang/src/os/file_unix.go:200 +0x91
os.OpenFile(0xc00277e120, 0x11, 0x0, 0x0, 0x0, 0xc00562fdf8, 0xc00562fce0)
/usr/lib/golang/src/os/file.go:327 +0x65
github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo.func2(0x0, 0xc002300740, 0x0, 0xc003dc13e0, 0x562e8f7ab580, 0xc002300700)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:130 +0x371
created by github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:123 +0x265
--
goroutine 9233463 [syscall, 834 minutes]:
syscall.Syscall6(0x101, 0xffffffffffffff9c, 0xc000b4c6a0, 0x80000, 0x0, 0x0, 0x0, 0x2eb4038, 0x1, 0x11c0)
/usr/lib/golang/src/syscall/asm_linux_amd64.s:41 +0x5
syscall.openat(0xffffffffffffff9c, 0xc000b4c360, 0x11, 0x80000, 0x0, 0x0, 0x0, 0x0)
/usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:68 +0xc8
syscall.Open(...)
/usr/lib/golang/src/syscall/syscall_linux.go:152
os.openFileNolog(0xc000b4c360, 0x11, 0x0, 0xc000000000, 0xc002011410, 0xc00275ed80, 0xc00499bfb0)
/usr/lib/golang/src/os/file_unix.go:200 +0x91
os.OpenFile(0xc000b4c360, 0x11, 0x0, 0x0, 0x0, 0x2000100000000, 0xc003043aa0)
/usr/lib/golang/src/os/file.go:327 +0x65
github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo.func2(0x0, 0xc002300780, 0x0, 0xc003dc14a0, 0x562e8f7ab580, 0xc002300700)
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:130 +0x371
created by github.com/docker/docker/vendor/github.com/containerd/fifo.openFifo
/builddir/build/BUILD/docker-20.10.7-5.amzn2/src/github.com/docker/docker/vendor/github.com/containerd/fifo/fifo.go:123 +0x265
As far as I found there was now way to display what file was actually being opened with that syscall.Open
and this is where we discovered the kernel bump in our AMI and eventually to this issue and the corresponding comment in aws repo pinning the kernel version https://github.com/awslabs/amazon-eks-ami/blob/1e89a4483c5bd2c34fcdf89d625fbd56085d83cf/scripts/upgrade_kernel.sh#L16 .
In order to trigger this issue we've been executing a kubectl rollout restart ...
of one of our daemonsets to tear down a pod on every node.
Hopefully some of this may be useful in tracking down the issue.
@kevinsimons-wf I have been seeing similar issues. Also followed a similar debug process. But I also have seen the issue happening without any container in a bad state.
In all cases tho (maybe you can check it out), before we start seeing the PLEG error we see a bunch of log lines containing the string fsHandler.go:119] failed to collect filesystem stats - rootDiskErr: could not stat
. right after those PLEG error appears.
Finally (I have not been able to verify it is the cause, but it could make sense) this error seems to happen within under 10 minutes from a new network interface is added, see dmesg -T
for lines such as :
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: [1d0f:ec20] type 00 class 0x020000
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: reg 0x10: [mem 0x00000000-0x00001fff]
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: reg 0x14: [mem 0x00000000-0x00001fff]
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: enabling Extended Tags
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: BAR 2: assigned [mem 0xc0300000-0xc03fffff pref]
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: BAR 0: assigned [mem 0xc0108000-0xc0109fff]
[Tue Nov 1 17:55:12 2022] pci 0000:00:08.0: BAR 1: assigned [mem 0xc010a000-0xc010bfff]
[Tue Nov 1 17:55:12 2022] ena 0000:00:08.0: enabling device (0000 -> 0002)
[Tue Nov 1 17:55:12 2022] ena 0000:00:08.0: ENA device version: 0.10
[Tue Nov 1 17:55:12 2022] ena 0000:00:08.0: ENA controller version: 0.0.1 implementation version 1
[Tue Nov 1 17:55:13 2022] ena 0000:00:08.0: Elastic Network Adapter (ENA) found at mem c0108000, mac addr 0a:09:cd:c5:0d:f5
[Tue Nov 1 17:55:16 2022] ena 0000:00:08.0 eth3: Local page cache is disabled for less than 16 channels
It could happen that there was a blip on the network that caused kubelet to fail reading the disk and entered in this bad state. At the same time, the same blip could be causing other processes to enter in bad state (I actually had a bunch of aws-node subprocesses in Z
state)
Same as you, hopefully this might help solve the issue...
AmazonLinux has released kernel 5.4.219-126.411.amzn2
yesterday, which we expect to solve the issues described here. We've tested it with our reproduction of the issue, and it seems to work. To solve this issue, you can do one of the following:
v20221101
does not have the problematic kernel patch, so it should not have the same issue reported here. It's just running an older kernel version, so switching to that AMI is another option.
v20221103
or later. We will update this repo as soon as those AMIs are released. At the moment, it's looking like 11/7/22 is when they'll be available.v20221101
, you'll need to unpin the kernel. I don't recommend this approach unless it's very important that you get on the latest kernel.Keeping open until we release v20221103
or later.
The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already.
FYI, updated mitigation recommendation above to call out that v20221101
is running on an older kernel version that doesn't have the problem described here.
The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already.
Same here.....
Hi, I see that the patched kernel mentioned is available. @mmerkes When should the new ami release come out?
Hi, I see that the patched kernel mentioned is available. @mmerkes When should the new ami release come out?
is already out the new image.. Look to this comment: https://github.com/awslabs/amazon-eks-ami/issues/1071#issuecomment-1303776788
The latest AMI doesn't have this kernel 5.4.219-126.411.amzn2.
Are there any release notes for kernel 5.4.219-126.411.amzn2
?
5.4.219-126.411.amzn2
seems available with release v20221104
Is it written anywhere what was the root cause and what was the fix? I'm interested.
Doing a proper post incident review should be a standard procedure. I am negatively surprised that it is not taking place.
Sorry, I was OOTO, so wasn't able to update here. v20221104 was released with the latest kernel, 5.4.219-126.411.amzn2
, which resolves this issue.
When issues like this happen, it is our standard practice to root cause the issue, mitigate and review our processes to avoid similar issues in the future. In this case, the process is ongoing. We have developed a test that reproduces this particular issue and have added that as part of our AMI release process. As for more details on the root cause, I'll post here if AmazonLinux has released anything with more details. In a nutshell, a commit on the kernel was causing some workloads to hang, which was reverted on the latest kernel.
With a bit of guesswork, I suspect the above issue referred to was https://github.com/torvalds/linux/commit/72d9ce5b085a1e7374f4aedc0892bd3b909f25c9 from 5.4.211
which was reverted in 5.4.219
?
I updated our clusters Tue night and things seemed fine, but today I'm seeing nodes flapping with NotReady state again with PLEG errors, and pods left in a Terminating state:
PLEG is not healthy: pleg was last seen active 8m16.90989123s ago; threshold is 3m0s
Kernel Version: 5.4.219-126.411.amzn2.x86_64 OS Image: Amazon Linux 2 Operating System: linux Architecture: amd64 Container Runtime Version: docker://20.10.17 Kubelet Version: v1.22.12-eks-ba74326 Kube-Proxy Version: v1.22.12-eks-ba74326
You could diagnose the previous issue by running ps
a few times and seeing a runc init
process that was hung. Do you see that on your node?
Thanks for the quick reply. I didn't see those processes on my nodes. Today I installed an efs-csi driver update that was quickly updated to a new release today, perhaps that was causing my issues. I spun up all new nodes in my clusters and things seem stable now.
What happened: We noticed an increased rate of PLEG errors while running our workload with v20221027 AMI release. The same issue doesn't appear with v20220926.
What you expected to happen: PLEG errors shouldn't appear with latest AMI
How to reproduce it (as minimally and precisely as possible): This issue can be easily reproduced with high CPU/memory workloads
Anything else we need to know?: So far it looks like an issue associated with the latest kernel version
5.4.217-126.408.amzn2
.Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
):aws eks describe-cluster --name <name> --query cluster.version
):uname -a
):cat /etc/eks/release
on a node):