awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

bug(containerd): sometimeDeadlineExceeded happend #2069

Closed wolfdate25 closed 5 days ago

wolfdate25 commented 5 days ago

What happened: Discovered that the cronjob stops running while in a running state. The following logs were generated in kubelet:

Nov 20 06:50:36 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:50:36.124178    3828 remote_runtime.go:366] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="60d9681fc1f51cca1dd96c6694b145587c0a0ebd1faea39eed9eb209634bba9e"
Nov 20 06:50:36 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:50:36.124233    3828 kuberuntime_container.go:784] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="hbt/hbt-cronjob-x-28867922-nh7sp" podUID="1db2521d-ce5b-4cf9-94c8-b6cdf7450e4c" containerName="x-test" containerID="containerd://60d9681fc1f51cca1dd96c6694b145587c0a0ebd1faea39eed9eb209634bba9e" gracePeriod=30
Nov 20 06:50:36 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:50:36.124253    3828 kuberuntime_container.go:822] "Kill container failed" err="rpc error: code =DeadlineExceeded desc = context deadline exceeded" pod="hbt/hbt-cronjob-x-28867922-nh7sp" podUID="1db2521d-ce5b-4cf9-94c8-b6cdf7450e4c" containerName="x-test" containerID={"Type":"containerd","ID":"60d9681fc1f51cca1dd96c6694b145587c0a0ebd1faea39eed9eb209634bba9e"}

Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.265655    3828 remote_runtime.go:366] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="bf0bf19f3e0dde9cbb488a5a0badc5815c95cac630aa4050e664339e1e1be263"
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.265720    3828 kuberuntime_container.go:784] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="hbt/hbt-cronjob-x-28867982-pnbpf" podUID="78c53ed8-d7bf-4f1a-a0b6-4a22cbb0623b" containerName="x-test" containerID="containerd://bf0bf19f3e0dde9cbb488a5a0badc5815c95cac630aa4050e664339e1e1be263" gracePeriod=30
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.265745    3828 kuberuntime_container.go:822] "Kill container failed" err="rpc error: code =DeadlineExceeded desc = context deadline exceeded" pod="hbt/hbt-cronjob-x-28867982-pnbpf" podUID="78c53ed8-d7bf-4f1a-a0b6-4a22cbb0623b" containerName="x-test" containerID={"Type":"containerd","ID":"bf0bf19f3e0dde9cbb488a5a0badc5815c95cac630aa4050e664339e1e1be263"}
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.891077    3828 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpcerror: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="fa8984ad470a4a35388386e60c3e4dc250761b61f25bc3a0ddced413e677f264"
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.891134    3828 kuberuntime_manager.go:1389] "Failed to stop sandbox" podSandboxID={"Type":"containerd","ID":"fa8984ad470a4a35388386e60c3e4dc250761b61f25bc3a0ddced413e677f264"}
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.891189    3828 kubelet.go:2058] [failed to "KillContainer" for "node" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "16bdee7d-159d-4344-b70c-d6cdd133520d" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Nov 20 06:51:22 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: E1120 06:51:22.891202    3828 pod_workers.go:1298] "Error syncing pod, skipping" err="[failed to \"KillContainer\" for \"node\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\", failed to \"KillPodSandbox\" for \"16bdee7d-159d-4344-b70c-d6cdd133520d\" with KillPodSandboxError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"]" pod="jenkins/node" podUID="16bdee7d-159d-4344-b70c-d6cdd133520d"
Nov 20 06:51:23 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: I1120 06:51:23.740012    3828 kuberuntime_container.go:779] "Killing container with a grace period" pod="jenkins/node" podUID="16bdee7d-159d-4344-b70c-d6cdd133520d" containerName="node" containerID="containerd://f7067ff850c7940e3c0c963e506810966ada83c1ed4b86760b421b5136734df7" gracePeriod=30
Nov 20 06:51:23 ip-10-130-21-94.ap-northeast-2.compute.internal kubelet[3828]: I1120 06:51:23.743690    3828 status_manager.go:863] "Pod was deleted and then recreated, skipping status update" pod="jenkins/node" oldPodUID="16bdee7d-159d-4344-b70c-d6cdd133520d" podUID="32f2e071-f648-46e7-b00c-ff2b1fc9258f"

When attempting to remove the container using crictl, the following logs were generated by containerd:

Nov 21 01:53:00 ip-10-130-3-113.ap-northeast-2.compute.internal containerd[1111853]: time="2024-11-21T01:53:00.777374390Z" level=info msg="Kill container \"298d33cc227f7cfe87259d109e904d026120e4c74332ed00c80e08648cc050d3\""
Nov 21 01:53:02 ip-10-130-3-113.ap-northeast-2.compute.internal containerd[1111853]: time="2024-11-21T01:53:02.777460445Z" level=error msg="StopContainer for \"298d33cc227f7\" failed" error="rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"298d33cc227f7cfe87259d109e904d026120e4c74332ed00c80e08648cc050d3\" to be killed: wait container \"298d33cc227f7cfe87259d109e904d026120e4c74332ed00c80e08648cc050d3\": context deadline exceeded"
Nov 21 01:53:04 ip-10-130-3-113.ap-northeast-2.compute.internal containerd[1111853]: time="2024-11-21T01:53:04.284946616Z" level=error msg="StopPodSandbox for \"f793baa40a74beedd902140d007d16bab953a0c4ae1c8005f9216481f97db1df\" failed" error="rpc error: code = DeadlineExceeded desc = failed to stop container \"a30c5fecddce111da50a3cd3689d6b08f6e6d7e33f2428bebf5a64fbf3d0f22f\": an error occurs during waiting for container \"a30c5fecddce111da50a3cd3689d6b08f6e6d7e33f2428bebf5a64fbf3d0f22f\" to be killed: wait container \"a30c5fecddce111da50a3cd3689d6b08f6e6d7e33f2428bebf5a64fbf3d0f22f\": context deadline exceeded"
Nov 21 01:53:04 ip-10-130-3-113.ap-northeast-2.compute.internal containerd[1111853]: time="2024-11-21T01:53:04.311928723Z" level=info msg="StopContainer for \"a30c5fecddce111da50a3cd3689d6b08f6e6d7e33f2428bebf5a64fbf3d0f22f\" with timeout 180 (s)"
Nov 21 01:53:04 ip-10-130-3-113.ap-northeast-2.compute.internal containerd[1111853]: time="2024-11-21T01:53:04.312501683Z" level=info msg="Skipping the sending of signal terminated to container \"a30c5fecddce111da50a3cd3689d6b08f6e6d7e33f2428bebf5a64fbf3d0f22f\" because a prior stop with timeout>0 request already sent the signal"

What you expected to happen: Containers should be terminated and created normally without interrupting the cronjob's execution. How to reproduce it (as minimally and precisely as possible): Set up a Kubernetes cluster with containerd versions 1.7.22 or 1.7.23 Deploy a cronjob and wait few hours Observe container termination and creation processes (cronjob lifecycle) Look for DeadlineExceeded errors in kubelet and containerd logs Environment: AWS Region: ap-northeast-2 Instance Type(s): m7i-flex Cluster Kubernetes version: 1.30 Node Kubernetes version: v1.30.6-eks-94953ac AMI Version: 1.30.6-20241115

cartermckinnon commented 5 days ago

Container termination failed with gracePeriod

That looks like an issue with your specific pod, please open a case with AWS support 👍

wolfdate25 commented 4 days ago

@cartermckinnon https://github.com/awslabs/amazon-eks-ami/issues/2070#issue-2678495279