crio jams up kubelet on restart if container takes a long time to terminate

olyazavr commented 1 week ago

What happened?

When a container takes a long time to terminate, and kubelet is restarted, kubelet hangs until the container terminates. This makes the whole node unready. I tracked this down to when cadvisor is started, and then to somewhere in newCrioContainerHandler, probably when it gets info about the container

In our setup, we have a certain workload that consists of a pod with two containers, and one of those eats sigterm and just waits for a long time to shut down. That container is the one that causes this whole mess.

Kubelet will just keep logging:

kubelet.service[2043933]: E1108 18:54:46.297244 2043933 kubelet.go:2383] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
kubelet.service[2043933]: I1108 18:54:46.320642 2043933 kuberuntime_manager.go:447] "Retrieved pods from runtime" all=true

As for logs, here's the last log that makes it in before the long wait for the container to shut down:

Nov 07 19:53:23 ip-172-16-111-198 kubelet.service[1162146]: I1107 19:53:23.354625 1162146 factory.go:275] Using factory "crio" for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d269700_ddd7_4c51_ba9e_40c014193013.slice/crio-0b74fdbd9af1713bb6abf1bfce22715675e13dd70b3e9c0fa2c312b63400186d" and after the container shuts down, we see:

Nov 07 19:57:16 ip-172-16-111-198 kubelet.service[1162146]: I1107 19:57:16.559026 1162146 manager.go:981] Added container: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0d269700_ddd7_4c51_ba9e_40c014193013.slice/crio-0b74fdbd9af1713bb6abf1bfce22715675e13dd70b3e9c0fa2c312b63400186d" (aliases: [k8s_POD_session-agent-6fdcdc7968-rvp7q_browsers-lighthouse_0d269700-ddd7-4c51-ba9e-40c014193013_0 0b74fdbd9af1713bb6abf1bfce22715675e13dd70b3e9c0fa2c312b63400186d], namespace: "crio")

which supports the theory that it's cadvisor hanging on getting the container info from crio.

Interestingly, crictl inspect works just fine

I got a goroutine dump from crio while it's hanging after the kubelet restart: here

What did you expect to happen?

Crio should not hang and kubelet should be able to start up

How can we reproduce it (as minimally and precisely as possible)?

Create a pod with two containers, one of which eats sigterm:

apiVersion: v1
kind: Pod
metadata:
  annotations:
  generateName: test-pod-
spec:
  terminationGracePeriodSeconds: 86300
  containers:
  - name: fast-shutdown
    command: ["bash", "-c",  "sleep infinity"]
  - name: never-shutdown
    command: ["bash", "-c", "trap 'sleep 100000' SIGTERM; sleep infinity"]

Delete the pod, and then restart kubelet on the node

Anything else we need to know?

No response

CRI-O and Kubernetes version

```console $ crio --version crio version 1.29.4 Version: 1.29.4 GitCommit: 51ea93e0b9af5ad2cfa7f8071ec48d99bf39a3ec GitCommitDate: 2024-04-30T06:05:39Z GitTreeState: clean BuildDate: 1970-01-01T00:00:00Z GoVersion: go1.21.7 Compiler: gc Platform: linux/amd64 Linkmode: static BuildTags: static netgo osusergo exclude_graphdriver_btrfs exclude_graphdriver_devicemapper seccomp apparmor selinux LDFlags: unknown SeccompEnabled: true AppArmorEnabled: false ``` ```console $ kubectl version --output=json { "clientVersion": { "major": "1", "minor": "29", "gitVersion": "v1.29.10", "gitCommit": "f0c1ea863533246b6d3fda3e6addb7c13c8a6359", "gitTreeState": "archive", "buildDate": "2024-11-04T18:09:18Z", "goVersion": "go1.22.8", "compiler": "gc", "platform": "linux/amd64" }, "kustomizeVersion": "v5.0.4-0.20230601165947-6ce0bf390ce3" } ```

OS version

```console # On Linux: $ cat /etc/os-release NAME="AlmaLinux" VERSION="9.3 (Shamrock Pampas Cat)" ID="almalinux" ID_LIKE="rhel centos fedora" VERSION_ID="9.3" PLATFORM_ID="platform:el9" PRETTY_NAME="AlmaLinux 9.3 (Shamrock Pampas Cat)" ANSI_COLOR="0;34" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos" HOME_URL="https://almalinux.org/" DOCUMENTATION_URL="https://wiki.almalinux.org/" BUG_REPORT_URL="https://bugs.almalinux.org/" ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9" ALMALINUX_MANTISBT_PROJECT_VERSION="9.3" REDHAT_SUPPORT_PRODUCT="AlmaLinux" REDHAT_SUPPORT_PRODUCT_VERSION="9.3" $ uname -a Linux ip-172-18-59-83 6.1.109-hs83.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Sep 24 17:33:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux ```

Additional environment details (AWS, VirtualBox, physical, etc.)

kwilczynski commented 1 week ago

@olyazavr, what if you lower the terminationGracePeriodSeconds property to 5 seconds?

When you say "jams up" kubelet, what does that mean? You cannot start new pods while the second container is running its sleep command?

You can run CRI-O with the debug log enabled and collect some logs from around the time when the pod is to be deleted. And also set kublelet log verbosity high enough (perhaps 8?) to collect debug logs, too.

olyazavr commented 1 week ago

@kwilczynski we're working on lowering terminationGracePeriodSeconds, but this workload is somewhat out of our hands.

When I say jams up, I mean that kubelet remains unready, stuck at this line, logging:

kubelet.service[2043933]: E1108 18:54:46.297244 2043933 kubelet.go:2383] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
kubelet.service[2043933]: I1108 18:54:46.320642 2043933 kuberuntime_manager.go:447] "Retrieved pods from runtime" all=true

over and over again. Pods cannot be interacted with (cannot be exec'd into, can't get logs, etc)

Yeah let me try those things. I ran crio with debug and nothing interesting came up (while stuck it logged about ListContainers/ListSandboxes requests/responses, but I can take a deeper look)

olyazavr commented 1 week ago

What's interesting is that crio 1.27.0 does not do this, but crio 1.29.4 does

haircommander commented 1 week ago

it also could be helpful to gather the goroutines crio status goroutines while it's happpening to see where cri-o is stuck. I'm guessing we're using a lock somewhere we formerly weren't

olyazavr commented 1 week ago

ok so I noticed that upon kubelet restart, when crio is about to hang: Nov 08 21:04:13 ip-172-25-51-239 crio[3687019]: time="2024-11-08 21:04:13.413601917Z" level=debug msg="Unmounted container 157be0669fe06e94ced8d5a0f7d10885beb961ca7026b1f1bcb9a049753529d9" file="storage/runtime.go:495" id=e6e47387-ad69-45a8-be9e-a3cf2dfe7a15 name=/runtime.v1.RuntimeService/StopContainer

whereas, in 1.27 (no hanging):

Nov 08 20:58:57 ip-172-18-56-53 crio[3891941]: time="2024-11-08 20:58:57.924068275Z" level=debug msg="Request: &ContainerStatusRequest{...

Nov 08 20:58:57 ip-172-18-56-53 crio[3891941]: time="2024-11-08 20:58:57.924305233Z" level=debug msg="Request: &ContainerStatusRequest{..

Nov 08 20:58:57 ip-172-18-56-53 crio[3891941]: time="2024-11-08 20:58:57.924571462Z" level=debug msg="Response: &ContainerStatusResponse{...

Nov 08 20:58:58 ip-172-18-56-53 crio[3891941]: time="2024-11-08 20:58:58.617516447Z" level=warning msg="Stopping container 82edf0c5d6897a86c6cdfed89c29ce1b5f4ac150a43c30c831cd8966afa8aa66 with stop signal timed out: context canceled" file="oci/runtime_oci.go:913" id=8c56b682-87f6-4b1e-8479-fe5306888174 name=/runtime.v1.RuntimeService/StopContainer

olyazavr commented 1 week ago

goroutines here: https://gist.github.com/olyazavr/789c401f9bb8d1b29e11e8c72522a61f

haircommander commented 1 week ago

are you able to try out a newer 1.29? there have been some changes to the stop code and I'm wondering if they've helped

olyazavr commented 1 week ago

this is the goroutine dump before kubelet restart, but when the workload that typically gets stuck is Terminating (not the same instance of workload as the above goroutine unfortunately) https://gist.github.com/olyazavr/2f682621112d61314f96c0df941a3273

olyazavr commented 1 week ago

@haircommander yes, let me try that out

olyazavr commented 1 week ago

No dice, I tried 1.29.10 and it still has the same problem. I should be able to try out 1.30/1.31 in the coming week

1.29.10 dump: https://gist.github.com/olyazavr/9919c73ff840add43159238378c78da4

it has that same line after restarting kubelet: Nov 09 16:06:32 ip-172-18-48-187 crio[15878]: time="2024-11-09 16:06:32.192621024Z" level=debug msg="Unmounted container 58c02065df5f5fe8ccb803a1bcbafa9cc669f38b929fe6a61c46466999ba3c9f" file="storage/runtime.go:495" id=eee1246f-b7b7-485b-83d4-803d5d5c0b58 name=/runtime.v1.RuntimeService/StopContainer

kwilczynski commented 1 week ago

@olyazavr, can you try a more recent version of CRI-O?

I am unable to reproduce this issue with CRI-O built using the master branch. But also, when using releases from 1.29 to 1.31, things work fine for me. There is no "hang" of kubelet, and it simply continues to allow me to schedule new pods, etc.

Slightly different reproducer:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-stop
  namespace: default
spec:
  containers:
    - name: quick-stop
      image: docker.io/library/ubuntu:22.04
      command:
        - /bin/sh
        - -c
        - |
          echo "Starting!"
          _term () {
              echo "Caught SIGTERM!"
              exit 0
          }
          trap _term TERM
          while true ; do date ; sleep 1 ; done
    - name: slow-stop
      image: docker.io/library/ubuntu:22.04
      command:
        - /bin/sh
        - -c
        - |
          echo "Starting!"
          _term () {
              echo "Caught SIGTERM!"
          }
          trap _term TERM
          while true ; do date ; sleep 1 ; done
  restartPolicy: Always
  terminationGracePeriodSeconds: 3600

I also have no issues using the template you provided as a reproducer.

The kubelet versions I tested were also from 1.29 to 1.31 from the official package repositories.

What runtime are you using? crun? runc? Also, what versions of the container runtime?

Would it be possible for you to collect pprof from CRI-O and kubelet when the issue is occurring for you? Aside from collecting goroutine dumps.

There is something you could do, too, if you have the time: bisect the code base to see which commit is potentially causing the issues, especially since you mentioned that CRI-O 1.27 does not have this behaviour. So, a bisect from 1.27 to 1.28, perhaps? Might be quite a job, given the number of commits, etc.

Having said all that, is there anything custom or special about your CRI-O configuration or your runtime environment?

olyazavr commented 1 week ago

So when you restart kubelet, does it kill the terminating pod or does it let it finish terminating? (For me, in 1.27, it would just kill the terminating container/pod without letting it finish out its grace period, which is fine, but just verifying that I'm seeing the same behavior here)

@olyazavr, no issues here. I can restart kubelet process without any issues. Nothing would "hang" or otherwise. Also, nothing kills pods and containers when this happens.

Tested with:

INFO[2024-11-13 13:21:15.598975778Z] Starting CRI-O, version: 1.27.8, git: 7597c4329bb447b23d2d150a903e6148fb99797e(clean)

We have a wrapper around crun that does a few things like kill any exec commands on container termination (for this bug https://github.com/cri-o/cri-o/issues/6699). It doesn't seem like it does very much but I'll dig deeper here

Try running things vanilla without any wrappers, especially when troubleshooting issues such as this one, in order to make sure things are run without anything that might change behaviour of some of the components (there is conmon and crun here, too). The issue you linked to might have been resolved with some updates that a more modern version of CRI-O enjoys, hopefully. :smile:

olyazavr commented 1 week ago

I just tried with 1.30 and it all works, no hanging! I also tried 1.29 without any custom logic but it did hang, not sure what else is in play there. Also something interesting is that in 1.30, I see the "right" behavior on kubelet restart- the terminating pod is allowed to continue to terminate, whereas in 1.27 I was seeing the terminating pod just get immediately killed after a kubelet restart

One thing that may be important is that we run with cadvisor (kubelet was hanging at StartCadvisor).

I'm a little nervous that there is still something wacky going on that we didn't find (either in crio or in our setup) and that this will just resurface again in a later version. I'll try more debugging on our end

kwilczynski commented 1 week ago

@olyazavr, there aren't that many commits, in terms of volume, between 1.30 and 1.29, hopefully, so feel free to bisect between the two versions.

I don't see the connection between kubelet and CRI-O (and there is conmon and the runtime here working behind the scenes) when it goes to kubelet restart. The RPC is primarily one-sided from kubelet to CRI-O, since a CRI does kubelet's bidding, so to speak. And there is also nothing that kills containers in kubelet—that's CRI's job.

Unless I am missing something here, too?

olyazavr commented 6 days ago

Interesting development, I noticed that cadvisor had changed versions between 1.27/1.29/1.30 and sure enough, upgrading to the version 1.30 has (I patched our 1.29 kubelet) made the bug disappear

1.27 had cadvisor v0.47.2 1.29 had cadvisor v0.48.1 1.30 had cadvisor v0.49.0

I'll close this out, thanks for all your help!

kwilczynski commented 4 days ago

@olyazavr, so you patched kubelet 1.27 you ran to use the newer cAdvisor dependency where you took the version from Kubernetes 1.29, which then worked, correct?

olyazavr commented 4 days ago

@kwilczynski I patched 1.29 kubelet to use the cadvisor version that 1.30 had (for me, 1.27 worked, 1.29 had the bug, and 1.30 worked again). My patched 1.29 kubelet worked without this bug

kwilczynski commented 4 days ago

@olyazavr, got it. Good to know this fixed the issues you were seeing for you!

The number of commits (changes) between releases v0.48.1 and v0.49.0 is not very large per:

https://github.com/google/cadvisor/compare/v0.48.1...v0.49.0

I wonder what the change was that fixed things for you, especially given that there aren't any significant changes of note between these two releases of cAdvisor. Very curious, indeed.

Worth noting is that cAdvisor is a rather slow-moving project nowadays, with releases far and few between.

I wish you had some time to do a bisect, but if not, then I will take the code backport as a win here, even though what actually was fixed is a bit of a X-Files level enigma.

kwilczynski commented 3 days ago

This might be related to an older issue:

https://github.com/cri-o/cri-o/issues/6865

olyazavr commented 2 days ago

I found that this PR: https://github.com/google/cadvisor/pull/3457 fixed my problem!

With cadvisor a https://github.com/google/cadvisor/commit/42bb3d13a0cf9ab80c880a16c4ebb4f36e51b0c9 - it works, and with the previous commit https://github.com/google/cadvisor/commit/27f1e92b0bcc9d97c10fdc5e6b808e2847e907fb it does not

kwilczynski commented 17 hours ago

@olyazavr, thank you! That's a good signal, indeed.

We could update our dependency on cAdvisor even for older releases. I don't see reasons why not.

@haircommander, anything against doing that?

cri-o / cri-o