canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.37k stars 765 forks source link

When Worker or Master Nodes gets shutdown they stuck at NotReady #4579

Open Aaron-Ritter opened 1 month ago

Aaron-Ritter commented 1 month ago

Summary

When stopping (restarting) a node in 1.30 we have regularly the issue that it does not get Ready again in the cluster.

microk8s inspect shows FAIL: Service snap.microk8s.daemon-kubelite is not running

What Should Happen Instead?

The node should come online without issues.

Reproduction Steps

The reproduction is not consistant so it is related to the start of the node or the shutdown before that.

  1. Setup 8 Debian 12 cloud image VMs
  2. Create a 4 Master 4 Worker Node HA cluster.
  3. Shutdown one of the nodes and check the Node to get ready.
kubectl get nodes
NAME          STATUS     ROLES    AGE   VERSION
k8s-test-m1   Ready      <none>   79d   v1.30.1
k8s-test-m2   Ready      <none>   79d   v1.30.1
k8s-test-m3   Ready      <none>   79d   v1.30.1
k8s-test-m4   Ready      <none>   79d   v1.30.1
k8s-test-n1   NotReady   <none>   39h   v1.30.1
k8s-test-n2   NotReady   <none>   39h   v1.30.1
k8s-test-n3   Ready      <none>   69d   v1.30.1
k8s-test-n4   Ready      <none>   39h   v1.30.1

On both nodes, kubernetes related pods just stay on running as status. And all application pods are Terminating.

ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-npz5d                 3/3     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-wvjdn                    3/3     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
kube-system                   calico-node-7qbzl                                1/1     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
metallb-system                speaker-82qn4                                    1/1     Running             0              39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-hjlwq                 3/3     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-zbnmp                    3/3     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
kube-system                   calico-node-qrdt2                                1/1     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
metallb-system                speaker-p74qt                                    1/1     Running             0              39h     10.14.214.42   k8s-test-n2   <none>           <none>
× snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite
     Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelite.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/snap.microk8s.daemon-kubelite.service.d
             └─delegate.conf
     Active: failed (Result: exit-code) since Fri 2024-07-19 13:10:13 UTC; 13min ago
   Duration: 426ms
    Process: 1860 ExecStart=/usr/bin/snap run microk8s.daemon-kubelite (code=exited, status=255/EXCEPTION)
   Main PID: 1860 (code=exited, status=255/EXCEPTION)
        CPU: 467ms

Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 5.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Stopped snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Start request repeated too quickly.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Jul 19 13:10:13 k8s-test-n1 systemd[1]: Failed to start snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.

snap.microk8s.daemon-kubelite.service.txt

after restarting the microk8s worker node manually with sudo snap stop microk8s and sudo snap start microk8s it recovered and reconnected:

ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-npz5d                 3/3     Running   3 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-wvjdn                    3/3     Running   3 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
kube-system                   calico-node-7qbzl                                1/1     Running   1 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
metallb-system                speaker-82qn4                                    1/1     Running   1 (14m ago)    39h     10.14.214.41   k8s-test-n1   <none>           <none>
ceph-csi-cephfs               ceph-csi-cephfs-nodeplugin-hjlwq                 3/3     Running   3 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
ceph-csi-rbd                  ceph-csi-rbd-nodeplugin-zbnmp                    3/3     Running   3 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
kube-system                   calico-node-qrdt2                                1/1     Running   1 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>
metallb-system                speaker-p74qt                                    1/1     Running   1 (14m ago)    39h     10.14.214.42   k8s-test-n2   <none>           <none>

If restarting the affected node does not work, removing and adding the node again is the only thing which helps.

Introspection Report

todo

Can you suggest a fix?

not at this moment

Are you interested in contributing with a fix?

yes, very happy to test and collaborate further on problem finding

Aaron-Ritter commented 1 month ago

while trying to extract inspect information i discovered the following, after shuting down one of my master nodes and the node being stuck at NotReady, as soon i run microk8s inspect it became Ready.

When i looked at the snap.microk8s.daemon-kubelite.service logs i discovered that it was in a endless restart loop and somehow the inspect got it out of.

my.log:Jul 20 17:29:36 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.
my.log:Jul 20 17:29:39 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 2.
my.log:Jul 20 17:29:42 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 3.
my.log:Jul 20 17:29:44 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 4.
my.log:Jul 20 17:29:47 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 5.
my.log:Jul 20 17:29:50 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 6.

....

my.log:Jul 20 17:36:11 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 132.
my.log:Jul 20 17:36:18 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 133.
my.log:Jul 20 17:36:25 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 134.
my.log:Jul 20 17:36:32 k8s-test-m3 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 135.

during the whole time it complained about netfilter Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory

is there anything the inspect would influence with regards to that?

Jul 20 17:29:25 k8s-test-m3 microk8s.daemon-kubelite[1148]: + /sbin/modprobe br_netfilter
Jul 20 17:29:25 k8s-test-m3 microk8s.daemon-kubelite[1148]: + echo 'Successfully loaded br_netfilter module.'
Jul 20 17:29:25 k8s-test-m3 microk8s.daemon-kubelite[1148]: Successfully loaded br_netfilter module.
Jul 20 17:29:35 k8s-test-m3 microk8s.daemon-kubelite[1148]: I0720 17:29:35.919600    1148 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=524288
Jul 20 17:29:35 k8s-test-m3 microk8s.daemon-kubelite[1148]: E0720 17:29:35.919632    1148 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Jul 20 17:29:35 k8s-test-m3 microk8s.daemon-kubelite[1148]: Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:29:35 k8s-test-m3 microk8s.daemon-kubelite[1148]: F0720 17:29:35.921810    1148 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:29:38 k8s-test-m3 microk8s.daemon-kubelite[1754]: I0720 17:29:38.863352    1754 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=524288
Jul 20 17:29:38 k8s-test-m3 microk8s.daemon-kubelite[1754]: E0720 17:29:38.863386    1754 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Jul 20 17:29:38 k8s-test-m3 microk8s.daemon-kubelite[1754]: Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:29:38 k8s-test-m3 microk8s.daemon-kubelite[1754]: F0720 17:29:38.863906    1754 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory

....

Jul 20 17:36:25 k8s-test-m3 microk8s.daemon-kubelite[33529]: I0720 17:36:25.108520   33529 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=524288
Jul 20 17:36:25 k8s-test-m3 microk8s.daemon-kubelite[33529]: E0720 17:36:25.108547   33529 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Jul 20 17:36:25 k8s-test-m3 microk8s.daemon-kubelite[33529]: Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:36:25 k8s-test-m3 microk8s.daemon-kubelite[33529]: F0720 17:36:25.109051   33529 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:36:31 k8s-test-m3 microk8s.daemon-kubelite[33869]: I0720 17:36:31.877017   33869 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=524288
Jul 20 17:36:31 k8s-test-m3 microk8s.daemon-kubelite[33869]: E0720 17:36:31.877041   33869 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Jul 20 17:36:31 k8s-test-m3 microk8s.daemon-kubelite[33869]: Error: open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:36:31 k8s-test-m3 microk8s.daemon-kubelite[33869]: F0720 17:36:31.877603   33869 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory
Jul 20 17:36:38 k8s-test-m3 microk8s.daemon-kubelite[34141]: I0720 17:36:38.599899   34141 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=524288
Jul 20 17:36:38 k8s-test-m3 microk8s.daemon-kubelite[34141]: I0720 17:36:38.599994   34141 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_tcp_timeout_established" value=86400
Jul 20 17:36:38 k8s-test-m3 microk8s.daemon-kubelite[34141]: I0720 17:36:38.600029   34141 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_tcp_timeout_close_wait" value=3600
Aaron-Ritter commented 1 month ago

possibly related to:

https://github.com/canonical/microk8s/issues/4342 https://github.com/canonical/microk8s/issues/4449