bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.64k stars 508 forks source link

Scale-in activity delays when admin container is enabled #3812

Open rodrigobersa opened 6 months ago

rodrigobersa commented 6 months ago

Image I'm using: bottlerocket-aws-k8s-1.28-x86_64-v1.19.1-c325a08b

What I expected to happen: Scale-in activities should take the same average amount of time either with admin container enabled or disabled.

What actually happened: Scale-in activities is taking more than 5 minutes when the admin container is enabled. If not enable, the scale-in process takes less than 2 minutes.

Apparently there is a once sigterm hits containerd, systemd starts repeatedly trying to deactivate the mount for what seems to be the admin host container without success.

How to reproduce the problem: Spin up a Managed Node Group, or Karpenter Nodepool with Bottlerocket family AMI. Enable admin container. Scale-out to any amount of replicas. Scale-in.

Feb 15 10:31:03 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.cNhkWt.mount: Deactivated successfully.
Feb 15 10:31:13 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.V9Sqd1.mount: Deactivated successfully.
Feb 15 10:31:33 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.fb2IM1.mount: Deactivated successfully.
Feb 15 10:31:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.8yARBs.mount: Deactivated successfully.
Feb 15 10:31:53 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.nNPatX.mount: Deactivated successfully.
Feb 15 10:32:23 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.wNYNZV.mount: Deactivated successfully.
Feb 15 10:32:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.adSYp4.mount: Deactivated successfully.
Feb 15 10:32:53 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.KXK3eY.mount: Deactivated successfully.
Feb 15 10:33:03 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.8na3Hj.mount: Deactivated successfully.
Feb 15 10:33:03 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.Q2oofj.mount: Deactivated successfully.
Feb 15 10:33:23 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.rzEq2c.mount: Deactivated successfully.
Feb 15 10:33:33 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.hIHHGm.mount: Deactivated successfully.
Feb 15 10:33:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.cl4hiM.mount: Deactivated successfully.
Feb 15 10:34:03 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.V8Ow0G.mount: Deactivated successfully.
Feb 15 10:34:13 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.7Ys1Dd.mount: Deactivated successfully.
Feb 15 10:34:13 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.DEGKUp.mount: Deactivated successfully.
Feb 15 10:34:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.PvgYkQ.mount: Deactivated successfully.
Feb 15 10:34:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.DWMUA7.mount: Deactivated successfully.
Feb 15 10:34:53 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.2BYjLl.mount: Deactivated successfully.
Feb 15 10:35:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.tlckbN.mount: Deactivated successfully.
Feb 15 10:35:53 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.1par7Q.mount: Deactivated successfully.
Feb 15 10:35:53 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.HdZbur.mount: Deactivated successfully.
Feb 15 10:36:03 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.9QlOtj.mount: Deactivated successfully.
Feb 15 10:36:13 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.Eg7exB.mount: Deactivated successfully.
Feb 15 10:36:23 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.ralHiZ.mount: Deactivated successfully.
Feb 15 10:36:26 ip-192-168-66-3.us-west-2.compute.internal apiserver[971]: 10:36:26 [INFO] Received exec request to localhost:/exec
Feb 15 10:36:26 ip-192-168-66-3.us-west-2.compute.internal apiserver[971]: 10:36:26 [INFO] exec process returned 0
Feb 15 10:36:26 ip-192-168-66-3.us-west-2.compute.internal apiserver[971]: 10:36:26 [INFO] Closing exec connection; message: "0"
Feb 15 10:36:26 ip-192-168-66-3.us-west-2.compute.internal apiserver[971]: 10:36:26 [INFO] Received exec request to localhost:/exec
Feb 15 10:36:33 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.hwUmTa.mount: Deactivated successfully.
Feb 15 10:36:33 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.E7VdUM.mount: Deactivated successfully.
Feb 15 10:36:37 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: Configuration file /etc/systemd/system/kubelet.service.d/exec-start.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Feb 15 10:36:43 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.087Jfc.mount: Deactivated successfully.
Feb 15 10:37:13 ip-192-168-66-3.us-west-2.compute.internal systemd[1]: run-containerd-runc-k8s.io-b3cd8f645b9345f01fb9a5976473d691862beeff1a60207bc28f7d36a0d4a197-runc.zRVIty.mount: Deactivated successfully.
yeazelm commented 6 months ago

Hello @rodrigobersa, I'll do some testing myself to see if I can reproduce this issue and get back to you.

webern commented 6 months ago

One thing we noticed is that the container that seems to be problematic is in the k8s-io namespace which means it is not the admin container. I don't think I see anything related to the admin container (though we can't rule out some interaction there).

Can you list the containers running on a host that is in this state?

enter-admin-container and use sudo sheltie, then ctr --namespace k8s.io images ls