flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
681 stars 30 forks source link

Cilium pod hangs on alpha-2969.0.0-hvm #484

Closed pschulten closed 3 years ago

pschulten commented 3 years ago

Description

Ciilium fails and kubernetes node is in status NotReady

Impact

Node is not usable

Environment and steps to reproduce

cilium version: quay.io/cilium/cilium:v1.9.5 kubernetes version: Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Additional information

Setup works on Flatcar-stable-2765.2.6

kubelet error message:

kubelet[13148]: E0823 11:57:14.914922   13148 pod_workers.go:190] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="kube-system/npd-v0.8.2-6zt87" podUID=44be878c-e77e-4744-873b-04c77a4e8ffb
kubelet[13148]: E0823 11:57:17.030551   13148 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = InvalidArgument desc = failed to create containerd container: create container failed validation: container.Runtime.Name must be set: invalid argument"

containerd error message:

env[5605]: time="2021-08-23T13:00:26.752080774Z" level=info msg="No cni config template is specified, wait for other system components to drop the config."
env[5605]: time="2021-08-23T13:00:29.729055720Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:cilium-wmz87,Uid:079fbd67-21f9-41d6-b72a-294d1bde76be,Namespace:kube-system,Attempt:0,}"
env[5605]: time="2021-08-23T13:00:29.739901277Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cilium-wmz87,Uid:079fbd67-21f9-41d6-b72a-294d1
tormath1 commented 3 years ago

Hi @pschulten, thanks for testing Alpha releases of Flatcar Container Linux ! :)

In 2969.0.0, we did an update to cgroupsv2 - see the changelog: https://kinvolk.io/flatcar-container-linux/releases/#release-2969.0.0.

You might be interested by reading this documentation: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/switching-to-unified-cgroups/#kubernetes

Ciilium fails

Is Cilium the only failing deployment ?

For your information, Cilium is tested in our CI with both v1.21 and v1.22 version of Kubernetes - do you have some specific configuration in your cluster ?

pschulten commented 3 years ago

Thanks. I will have a look.

Is Cilium the only failing deployment ?

It's kind of required for all other deployments (network)

do you have some specific configuration in your cluster ?

Yes, lots of. Maybe you could point me to something more specific?

Cilium setup is pretty generic: helm template of the original chart with some changes (version/secrets) The basic setup is: terraform -> infra; terraform ignition -> OS; kubeadm join/init with systemd oneshot script

jepio commented 3 years ago

I have seen this in the past when containerd config was invalid. Are you using dockershim or containerd as CRI? You're using systemd cgroup driver for Kubernetes, correct?

Could you paste the following commands output: systemctl status kubelet systemctl cat containerd systemctl status containerd crictl info

pschulten commented 3 years ago

I'm using containerd. My tests were on a manually edited nodepool ASG of an existing (stable Flatcar). I'm going to install a complete new cluster with 2969 and keep you posted.

I'm also going to modify the recommended cgroup settings, sorry for not reading :(

jepio commented 3 years ago

Sorry, I think I found the problem. Our containerd config.toml accidentally zeros out the 'runtime_type' field. If your deployment method supports customizing the containerd config, then you need the following entry:

version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

I'll get this fixed in the closest release.

pschulten commented 3 years ago

no problem. I replaced to the whole config.toml with your entry and everything works like a charm:

# crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps | grep cilium
df712162e96fa       933a120da2c51       2 minutes ago        Running             cilium-operator           0                   eab24834e74af
eb479faff48c4       38d6adab1281e       2 minutes ago        Running             cilium-agent              0                   24947ba371cd2
# kubectl get no ip-10-x-x-x.eu-central-1.compute.internal -o wide --no-headers
ip-10-x-x-x.eu-central-1.compute.internal   Ready   <none>   6m55s   v1.21.1   10.x.x.x   <none>   Flatcar Container Linux by Kinvolk 2969.0.0 (Oklo)   5.10.59-flatcar   containerd://1.5.5
# cat /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true
jepio commented 3 years ago

Lovely! We'll ship a fix in the next alpha minor (2969.1.0).

jepio commented 3 years ago

Thank you so much for reporting this!