fujitatomoya / ros_k8s

Kuberenetes / ROS&ROS2 Cluster Samples
Creative Commons Attribution 4.0 International
185 stars 26 forks source link

failed to create cluster with `[kubelet-start] WARNING: unable to start the kubelet service` #45

Closed fujitatomoya closed 4 months ago

fujitatomoya commented 4 months ago

Just happened to see this error. either primary or worker, this can be observed as stopper for cluster creation.

root@tomoyafujita:~# kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket unix:///var/run/containerd/containerd.sock
I0509 16:20:05.929932   68214 version.go:256] remote version is much newer: v1.30.0; falling back to: stable-1.27
[init] Using Kubernetes version: v1.27.13
[preflight] Running pre-flight checks
    [WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
W0509 16:20:14.986067   68214 checks.go:835] detected that the sandbox image "registry.k8s.io/pause:3.6" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image.
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local tomoyafujita] and IPs [10.96.0.1 192.168.1.209]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost tomoyafujita] and IPs [192.168.1.209 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost tomoyafujita] and IPs [192.168.1.209 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
W0509 16:20:19.115946   68214 kubelet.go:43] [kubelet-start] WARNING: unable to start the kubelet service: [exit status 1]
[kubelet-start] Please ensure kubelet is reloaded and running manually.
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.

Unfortunately, an error has occurred:
    timed out waiting for the condition

This error is likely caused by:
    - The kubelet is not running
    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
    - 'systemctl status kubelet'
    - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
    - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
    Once you have found the failing container, you can inspect its logs with:
    - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
fujitatomoya commented 4 months ago

this does not happen with the following. (Ubuntu 20.04 , k8s v1.26.7)

root@tomoyafujita:~# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:22:13Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
root@tomoyafujita:~# kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
tomoyafujita   Ready    control-plane   9m19s   v1.26.7   43.135.146.89   <none>        Ubuntu 20.04.6 LTS   5.15.0-102-generic   containerd://1.6.31
fujitatomoya commented 4 months ago

the same problem can be observed with Ubuntu 22.04 and k8s v1.26.15.

the root cause is DNS limits, this hits the following error on kubelet service.

(my guess is, both upgrading ubuntu 22.04 and k8s version affects this behavior, not gonna dig deeper for this.)

root@edgemaster:/home/edgemaster/github.com/fujitatomoya/ros_k8s# systemctl status kubelet
Warning: The unit file, source configuration file or drop-ins of kubelet.service changed on disk. Run 'systemctl daemon-reload' t>
○ kubelet.service
     Loaded: masked (Reason: Unit kubelet.service is masked.)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: inactive (dead) since Fri 2024-02-09 11:26:41 PST; 2 months 30 days ago
   Main PID: 20882 (code=exited, status=0/SUCCESS)
        CPU: 41.031s

Feb 09 11:25:02 edgemaster kubelet[20882]: E0209 11:25:02.547981   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:25:35 edgemaster kubelet[20882]: E0209 11:25:35.548358   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:25:54 edgemaster kubelet[20882]: E0209 11:25:54.549236   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:25:58 edgemaster kubelet[20882]: E0209 11:25:58.547797   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:26:04 edgemaster kubelet[20882]: E0209 11:26:04.547791   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:26:23 edgemaster kubelet[20882]: E0209 11:26:23.548518   20882 dns.go:156] "Nameserver limits exceeded" err="Nameserver>
Feb 09 11:26:41 edgemaster systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Feb 09 11:26:41 edgemaster systemd[1]: kubelet.service: Deactivated successfully.
Feb 09 11:26:41 edgemaster systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Feb 09 11:26:41 edgemaster systemd[1]: kubelet.service: Consumed 41.031s CPU time.

the solution is,

### unmask the service if masked
root@edgemaster:~# systemctl unmask kubelet
Removed /etc/systemd/system/kubelet.service.

### configure kubelet to use static resolver file
root@edgemaster:~# cat /etc/default/kubelet 
KUBELET_EXTRA_ARGS=--resolv-conf=/etc/resolv-static.conf

### create static resolver configuration for stability
root@edgemaster:~# cat /etc/resolv-static.conf
xxxx

### restart k8s api server
root@edgemaster:~# kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket unix:///var/run/containerd/containerd.sock
I0510 10:43:15.825665    9150 version.go:256] remote version is much newer: v1.30.0; falling back to: stable-1.26
[init] Using Kubernetes version: v1.26.15
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

root@edgemaster:~# kubectl get nodes -o wide
NAME         STATUS   ROLES           AGE   VERSION    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
edgemaster   Ready    control-plane   22s   v1.26.15   43.135.146.155   <none>        Ubuntu 22.04.4 LTS   6.5.0-28-generic   containerd://1.6.31

see also https://github.com/canonical/microk8s/issues/3786 and https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues

fujitatomoya commented 4 months ago

closing in favor of https://github.com/fujitatomoya/ros_k8s/pull/48