Inconsistent control plane state

djjudas21 commented 1 year ago

Summary

Something has gone wrong with (I think) the api-server or dqlite backend. Cluster state is inconsistent and replication controllers are not working.

I have 4 nodes (kube05-kube08) and earlier today I noticed that some of my workloads stopped. I saw that kube07 had some stuck processes and 100% CPU usage on it, so I attempted to drain it and reboot it. It never drained properly; all the pods that were on kube07 went to Unknown state and never got rescheduled.

I did microk8s leave on kube07 but all the kube07 resources remained on the cluster. I force-deleted all of those pods but they never got rescheduled. Now I re-added kube07 but nothing is being scheduled on it, even where there are daemonsets etc.

For example, in the monitoring namespace there are 2 daemonsets and both are broken in different ways:

[jonathan@poseidon-gazeley-uk ~]$ kubectl get no
NAME     STATUS   ROLES    AGE   VERSION
kube07   Ready    <none>   74m   v1.26.0
kube05   Ready    <none>   14d   v1.26.0
kube08   Ready    <none>   14d   v1.26.0
kube06   Ready    <none>   14d   v1.26.0

[jonathan@poseidon-gazeley-uk ~]$ kubectl get daemonset
NAME                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
prometheus-stack-prometheus-node-exporter   4         4         4       4            4           <none>          22h
prometheus-smartctl-exporter-0              4         4         2       4            2           <none>          22h

[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME                                                  READY   STATUS    RESTARTS   AGE
prometheus-smartctl-exporter-0-dsvxt                  1/1     Running   0          23h
prometheus-smartctl-exporter-0-nwppn                  1/1     Running   0          23h
prometheus-smartctl-exporter-0-vgxvg                  1/1     Running   0          23h
prometheus-stack-prometheus-node-exporter-2zpns       2/2     Running   0          23h
prometheus-stack-kube-state-metrics-9b97fb746-kc5rw   1/1     Running   0          23h
prometheus-stack-prometheus-node-exporter-275pd       2/2     Running   0          23h
prometheus-stack-prometheus-node-exporter-g5ftd       2/2     Running   0          23h

There are only 3 of the prometheus-smartctl-exporter and only 2 of the prometheus-stack-prometheus-node-exporter, but the Desired/Current/Ready status is wrong. Something is obviously seriously wrong with the Kubernetes control plane, but I can't figure out what.

What Should Happen Instead?

If a node fails, Kubernetes should reschedule pods on other nodes
The reported number of replicas should match the actual number of replicas
If the desired number of replicas is not the same as the current number, Kubernetes should schedule more

Reproduction Steps

I can't consistently reproduce.

Introspection Report

microk8s inspect took a long time to run on all nodes, but reported no errors. It was kube07 that was removed and re-added to the cluster.

kube05-inspection-report-20230206_185244.tar.gz

kube06-inspection-report-20230206_185337.tar.gz

kube07-inspection-report-20230206_185248.tar.gz

kube08-inspection-report-20230206_185353.tar.gz

Can you suggest a fix?

It may be related to #2724 which I reported over a year ago, and was never resolved

Are you interested in contributing with a fix?

Yes, if I can. I feel like this is a serious and ongoing problem with MicroK8s

djjudas21 commented 1 year ago

This morning I'm also seeing this:

[jonathan@poseidon-gazeley-uk monitoring]$ kubectl get no
E0207 09:21:13.890135 1404195 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.896982 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.898445 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.899407 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME     STATUS   ROLES    AGE   VERSION
kube08   Ready    <none>   14d   v1.26.0
kube06   Ready    <none>   14d   v1.26.0
kube07   Ready    <none>   15h   v1.26.0
kube05   Ready    <none>   14d   v1.26.0

djjudas21 commented 1 year ago

Replication controller is broken too... you can't scale things down (here I was trying to scale down ownCloud while my cluster is inconsistent)

[jonathan@poseidon-gazeley-uk ~]$ kubens owncloud
Context "microk8s" modified.
Active namespace is "owncloud".

[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME                            READY   STATUS    RESTARTS        AGE
owncloud-redis-master-0         1/1     Running   1 (3d14h ago)   14d
owncloud-7fdd84f69b-7zcr2       0/1     Error     524             13d

[jonathan@poseidon-gazeley-uk ~]$ kubectl get deploy
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
owncloud        1/1     1            1           14d

# This took a long time
[jonathan@poseidon-gazeley-uk ~]$ kubectl scale deploy owncloud --replicas=0
deployment.apps/owncloud scaled

# What is going on here?!
[jonathan@poseidon-gazeley-uk ~]$ kubectl get deploy
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
owncloud        1/0     1            1           14d

[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME                            READY   STATUS    RESTARTS        AGE
owncloud-redis-master-0         1/1     Running   1 (3d14h ago)   14d
owncloud-7fdd84f69b-7zcr2       0/1     Error     524             13d

djjudas21 commented 1 year ago

All nodes are printing a lot of errors in their inspection-report/snap.microk8s.daemon-k8s-dqlite/journal.log like

Feb 06 00:21:37 kube08 microk8s.daemon-k8s-dqlite[11548]: time="2023-02-06T00:21:37Z" level=error msg="error in txn: query (try: 0): context canceled"

Does this point to a loss of dqlite quorum?

ktsakalozos commented 1 year ago

Hi @djjudas21 could you remove the offending node (kube07) with microk8s leave and then from a node in the cluster do a microks8 remove-node kube07 so that the cluster knows that node is expected to be removed. After this clean removal you should be able to re-join the node.

djjudas21 commented 1 year ago

Thanks @ktsakalozos. I already tried this, but I will try again if you think it will help

djjudas21 commented 1 year ago

Well, I removed kube07 and I also decided to remove kube08 because that way the HA mode will be disabled and hopefully the quorum issue will be resolved.

kube07 was removed easily because no pods ever got scheduled on it last time I re-added it.

kube08 was removed by doing microk8s leave and then running microks8 remove-node kube07 from kube05. It appeared to leave cleanly but all its pods are still reported to be healthy!

[jonathan@poseidon-gazeley-uk photoprism]$ kubectl get po -A -o wide | grep kube08
openebs            openebs-ndm-node-exporter-rq62w                       1/1     Running       0                  14d     10.1.13.197    kube08   <none>           <none>
openebs            openebs-ndm-zd8lm                                     1/1     Running       0                  14d     192.168.0.56   kube08   <none>           <none>
openebs            openebs-cstor-csi-node-zq8fm                          2/2     Running       0                  14d     192.168.0.56   kube08   <none>           <none>
openebs            cstor-disk-pool-kgb4-6ff874bd7b-c5sxb                 3/3     Running       0                  14d     10.1.13.199    kube08   <none>           <none>
democratic-csi     truenas-democratic-csi-node-rkmq8                     4/4     Running       0                  14d     192.168.0.56   kube08   <none>           <none>
reactive-resume    reactive-resume-postgresql-0                          1/1     Running       0                  3d12h   10.1.13.200    kube08   <none>           <none>
kube-system        node-problem-detector-spljp                           1/1     Running       0                  39h     10.1.13.210    kube08   <none>           <none>
monitoring         prometheus-smartctl-exporter-0-nwppn                  1/1     Running       0                  39h     192.168.0.56   kube08   <none>           <none>
kube-system        calico-node-679fj                                     1/1     Running       0                  14d     192.168.0.56   kube08   <none>           <none>
owncloud           navidrome-jon-cf5fd75fd-v4xsm                         1/1     Running       0                  7d20h   10.1.13.209    kube08   <none>           <none>
metallb-system     speaker-vsgtn                                         1/1     Running       0                  3d19h   192.168.0.56   kube08   <none>           <none>
monitoring         prometheus-stack-prometheus-node-exporter-g5ftd       2/2     Running       0                  39h     192.168.0.56   kube08   <none>           <none>
camerahub-prod     camerahub-static-7976477cbc-nxvv4                     1/1     Running       0                  13d     10.1.13.202    kube08   <none>           <none>
camerahub-prod     camerahub-app-6fcd5cb58-95czm                         1/1     Running       0                  13d     10.1.13.203    kube08   <none>           <none>
cert-manager       cert-manager-6c6bdff4d7-55zsh                         1/1     Running       394 (14m ago)      3d19h   10.1.13.215    kube08   <none>           <none>
openebs            openebs-cstor-csi-controller-0                        5/6     Error         901                14d     10.1.13.198    kube08   <none>           <none>
kube-system        metrics-server-6f754f88d-ncvd4                        0/1     Terminating   0                  3d15h   <none>         kube08   <none>           <none>
ingress            nginx-ingress-microk8s-controller-8d5bf               1/1     Running       0                  14d     10.1.13.193    kube08   <none>           <none>

I confirmed that kube05 is the only remaining datastore, and I am running these kubectl commands against kube05

jonathan@kube05:~$ microk8s status
microk8s is running
high-availability: no
  datastore master nodes: 192.168.0.57:19001
  datastore standby nodes: none

I am concerned about doing anything more destructive at the moment because I am using OpenEBS CStor volumes, which places 3 volume replicas across my 4 nodes. Therefore it is safe to destroy 2 nodes but not 3 (it can recover from 1 replica). I also can't accurately verify or migrate my CStor replicas because of the api-server.

Now kube05 is the only master datastore, what about stopping and starting microk8s and see if it returns to a normal state?

djjudas21 commented 1 year ago

I've left it a few hours and the cluster hasn't settled down - the pods on the now-absent kube08 are still reported as Running 1/1 and not marked as missing or unhealthy. k8s-dqlite is running at 100% CPU on kube05 (the master). Not sure if I'm hitting #3227

djjudas21 commented 1 year ago

I still can't figure out what's going on with this. I don't have control of my cluster, the dqlite master is running at 100% CPU. I am reluctant to do anything really invasive because I have OpenEBS/CStor volumes on my nodes which obviously depend on the api-server for their own quorum.

My gut instinct is to destroy the cluster and recreate from scratch to restore service, but I had to do this 3 weeks ago too (https://github.com/canonical/microk8s/issues/3204#issuecomment-1398873620), and it just broke again on its own. I don't have confidence in recent releases of MicroK8s to be stable and durable, especially with hyperconverged storage.

At the moment I'm not touching anything but I'm thinking of improving my off-cluster storage first, and rebuilding without using OpenEBS/CStor, for the security of my data.

I appreciate that MicroK8s is free and that people put a lot of hard work into it, but I feel that there are some serious problems that aren't getting enough attention. There are several open issues about reliability problems potentially related to dqlite that remain open:

886
3204
3227
3675

ktsakalozos commented 1 year ago

@djjudas21 a MicroK8s cluster starts with a single node to which you join other nodes to form a multinode cluster. As soon as the cluster has three control plane nodes (nodes joined without the --worker flag) it becomes HA. HA means that the datastore is replicated across three nodes and will resist one node failure without disrupting its operation. In an HA cluster you have to make sure you have at least three control plane nodes running all the time. If you drop below 2 nodes the cluster will freeze because there is no quorum to replicate the datastore. In this case (an HA cluster left with a single node) you can recover following the instructions in Recover from lost quorum page. This should explain why when we left the cluster with one node (kube05) you where able to read its state but not change it. That is, you could see the kubectl output but you cloud not do anything on the cluster. In such a frozen cluster the datastore logs (journalctl -fu snap.microk8s.daemon-k8s-dqlite) report errors like context deadline exceeded and database locked.

As we said the first 3 nodes in an HA cluster replicate the datastore. Any write on the datastore needs to be acked by the majority of the nodes, the quorum. As we scale the control-plane nodes beyond 3 the next two nodes maintain a replica of the datastore and are in standby in case a node from the first three departs. When a node misses some heartbeats it gets replaced and the rest of the nodes agree on the role each node plays. This node is stored in /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml. In the case of the cluster we have:

kube05
- Address: 192.168.0.57:19001
  ID: 3297041220608546238
  Role: 0

kube06
- Address: 192.168.0.58:19001
  ID: 796923914728165793
  Role: 0

kube08 
- Address: 192.168.0.56:19001
  ID: 3971249779275177663
  Role: 0

kube07
- Address: 192.168.0.59:19001
  ID: 16754272739278091721
  Role: 1

Node kube07 has a role of 1 meaning it is standby (replicating the datastore but not participating in the quorum). The rest of the nodes have role 0 meaning they are voters/part of the quorum.

The nodes cannot change IP. Even if they have crashed or left the cluster or are misbehaving they are still considered part of the cluster because they may be rebooting or have some network connectivity problems or go through maintenance etc. If a know a node will not be coming back in its previous state we must call mcirok8s remove-node to let the cluster know that specific node has departed permanently. It is important to point out that if we, for example, "format and reinstall" a node and reuse the hostname and IP this should be considered a departed node so it has to be microk8s remove-node before we join it again.

Of course the above do not explain why the cluster initially started to freeze. Hopefully it may give you some insight on what it happening under the hood. I am not sure what happened around "Feb 06 00:43:43". Will keep digging in the logs. A way to reproduce the issue would be ideal.

One question I have is why do I see in the dmesg logs on a couple of nodes errors like this:

[1134653.101082] EXT4-fs error (device sdb): __ext4_find_entry:1658: inode #2: comm navidrome: reading directory lblock 0

djjudas21 commented 1 year ago

Thanks @ktsakalozos for the detailed explanation, I appreciate it.

I had incorrectly assumed that if you have an HA cluster and you remove nodes to return it to 1 remaining control plane node, it would resume quorum automatically because it's the only node. This explains why removing nodes made my problem worse. I understand about having to do microk8s leave and microk8s remove-node, and that reinstalling makes it a "new" node.

I didn't know about the api-server becoming silently read-only in this situation. That certainly explains some of my symptoms. Is it possible to make this failure mode more prominent? Like every kubectl command should give a warning, and maybe there could be a Prometheus rule to catch this. The user needs to be aware of this failure situation so they don't do what I did and make it worse :facepalm:

In the error snippet about /dev/sdb, this device is an OpenEBS/CStor volume mounted over iSCSI, not a physical device in the node. I suspect it is reporting errors because OpenEBS relies on the cluster to manage itself, and the broken quorum damaged some of the volumes and took them offline. I think this is a symptom, not a cause.

So the underlying problem is the loss of quorum. At Feb 06 00:43:43 I was asleep and not working on the cluster. The uptime of the nodes shows there was no power outage etc. I have two theories:

Some kind of high write latency for dqlite, although this seems unlikely as these nodes have a SATA SSD
Multiple nodes upgraded the MicroK8s snap simultaneously. All my nodes are set to follow the 1.26/stable channel but this had an update around this time and maybe >=2 nodes updated at the same time.

1.26/stable:           v1.26.1         2023-02-07 (4595) 176MB classic

I will look at snap set system refresh.timer to see if I can get my nodes to update on different days to avoid this.

djjudas21 commented 1 year ago

All 4 of my kube nodes have the same default snap refresh window of 00:00, 4 times during the day.

jonathan@kube05:~$ snap refresh --time
timer: 00:00~24:00/4
last: today at 06:19 UTC
next: today at 14:49 UTC

Snap only keeps a log of successful refresh events for 24 hours so I can't tell if my nodes updated simultaneously, but it seems likely given that they all have the same window.

This is my workaround:

jonathan@kube05:~$ sudo snap set system refresh.timer=mon

jonathan@kube06:~$ sudo snap set system refresh.timer=tue

jonathan@kube07:~$ sudo snap set system refresh.timer=wed

jonathan@kube08:~$ sudo snap set system refresh.timer=thu

I think it is worth documenting this somewhere, because it is a simple workaround for a serious problem. I don't know much about how Snap works but it looks like there are several options you can set which could have prevented my failure. Are you able to set any of these options on the MicroK8s package as a Snap package maintainer?

ktsakalozos commented 1 year ago

I do not see a snap refresh around the time of the incident. I do not even see a service restart let alone a refresh. Did you see anything that leads you to believe it is a snap refresh related issue?

Is it possible that at some point any of the nodes run out of disk space? I see in the inspection scripts that the nodes report that they do not have disk pressure but kube07 has "lastTransitionTime" to NoDiskPressure state at "2023-02-06T17:36:22Z" and kube06 the "lastTransitionTime" to NoDiskPressure is "2023-02-06T04:34:55Z". Is there a way for us to know what was the disks state at that point? Is there a workload that could consume too much disk space?

djjudas21 commented 1 year ago

It was just a suspicion, based on that snap auto-refreshes overnight.

It seems unlikely about them running out of disk. They all have plenty of free space and I don't think any workloads use ephemeral storage in any real way. I do have monitoring for disk space and that wasn't firing an alert, but I have since lost my Prometheus volume so I can't check historical data.

jonathan@kube05:~$ df -h /
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   98G   31G   63G  33% /

jonathan@kube06:~$ df -h /
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  107G   35G   68G  34% /

jonathan@kube07:~$ df -h /
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  226G   16G  199G   8% /

jonathan@kube08:~$ df -h /
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  116G   23G   88G  21% /

djjudas21 commented 1 year ago

So I ran through the guide to recover from lost quorum, with 2 nodes (kube05 and kube06) and the following cluster.yaml:

- Address: 192.168.0.57:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.0.58:19001
  ID: 796923914728165793
  Role: 2

I started up both nodes but there is no change in behaviour. kube05 (the master) has dqlite consuming 100% CPU again, calls to the api-server with kubectl sometimes time out, and the dqlite log continues to show

Feb 09 13:38:46 kube05 microk8s.daemon-k8s-dqlite[3409946]: time="2023-02-09T13:38:46Z" level=error msg="error in txn: query (try: 0): context canceled"
Feb 09 13:38:49 kube05 microk8s.daemon-k8s-dqlite[3409946]: time="2023-02-09T13:38:49Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"

So I don't know what I did wrong but it looks like the dqlite is still read-only. Maybe worth killing kube06 and just starting up with kube05 as a single node cluster?

djjudas21 commented 1 year ago

I tried again to restore kube05 as a single node on its own, still read-only dqlite. Any ideas how to restore quorum? I don't really want to to start from scratch because I will lose the contents of my cStor volumes

ktsakalozos commented 1 year ago

@djjudas21 could you share a new inspection tarball? I wonder what logs are saying.

ktsakalozos commented 1 year ago

Do you think it would be possible to somehow share the contents of the /var/snap/microk8s/current/var/kubernetes/backend/ directory. Note this is the data from the datastore. I am wondering if the 100% cpu is because of a data corruption we are no seeing.

djjudas21 commented 1 year ago

Corruption seems plausible. It took 3m 45s to create this inspection report, and it seemed to be stuck on the dqlite step.

inspection-report-20230209_152432.tar.gz

I could share the /var/snap/microk8s/current/var/kubernetes/backend/ somewhere privately, but it is almost 1GB and contains secrets. Let me try and find somewhere

ktsakalozos commented 1 year ago

@djjudas21 would it be possible to append the --debug flag in /var/snap/microk8s/current/args/k8s-dqlite restart the dqlite service with sudo systemctl restart snap.microk8s.daemon-k8s-dqlite and after we see the 100% cpu utilization to produce a new inspection report?

djjudas21 commented 1 year ago

@ktsakalozos we've actually had a couple of power outages in the area tonight so I would've lost quorum anyway! :sweat_smile:

The nodes are all powered back on now, I've added the --debug flag to dqlite and restarted it. High CPU usage started immediately, so I let it run for a few minutes before producing the new inspection report. Thanks

inspection-report-20230209_211845.tar.gz

ktsakalozos commented 1 year ago

@djjudas21 could you run a quick test? While dqlite is in a 100% CPU utilization do a sudo systemctl stop snap.microk8s.daemon-kubelite and wait a few moments to see if the dqlite CPU utilization drops. I would like to check if the CPU usage is due to the load k8s is producing or if dqlite is consuming CPU cycles in some internal loop.

MathieuBordere commented 1 year ago

@djjudas21

Can you perform

echo '
LIBRAFT_TRACE=1
LIBDQLITE_TRACE=1
UV_THREADPOOL_SIZE=64
' | sudo tee -a /var/snap/microk8s/current/args/k8s-dqlite-env

sudo snap restart microk8s.daemon-k8s-dqlite

like described here https://github.com/canonical/microk8s/issues/3419#issuecomment-1397445789 please? It's the same command as in the link, but with the extra UV_THREADPOOL_SIZE=64, which is a long shot, but might help.

Can you then please provide the dqlite logs of a period with high CPU usage (couple of minutes is fine)? Don't forget to remove the env variables from the file again as the *_TRACE variables will spam the logs.

edit: it would be interesting to have those logs with kubelite stopped like @ktsakalozos suggested as they would then contain more relevant information.

djjudas21 commented 1 year ago

@ktsakalozos

Output of top before stopping kubelite:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1258 root      20   0 5559608   3.2g  19612 S 102.0  20.9 790:24.38 k8s-dqlite
 777662 root      20   0 1348340 606044  94600 S  12.2   3.7   0:12.88 kubelite

And after stopping kubelite:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                
   1258 root      20   0 5559612   3.2g  19612 S 101.3  20.9 793:23.84 k8s-dqlite

So, no change in dqlite CPU usage.

djjudas21 commented 1 year ago

@MathieuBordere

Thanks. I added the env vars and restarted dqlite at 10:07. Then I let it run for a minute or two and ran microk8s inspect

inspection-report-20230210_101530.tar.gz

MathieuBordere commented 1 year ago

@djjudas21 Can you do the same with kubelite turned off please? Would be interesting to see the difference.

djjudas21 commented 1 year ago

Here you go @MathieuBordere inspection-report-20230210_102606.tar.gz

MathieuBordere commented 1 year ago

@ktsakalozos Do you know who/what causes

Feb 10 10:25:18 kube05 microk8s.daemon-k8s-dqlite[798763]: time="2023-02-10T10:25:18Z" level=debug msg="TRIGGERED /registry/crd.projectcalico.org/ipamhandles/k8s-pod-network.92488c470d1d0d13de2610a5f999cca354baa6c1400005f698706a0175f3e34d, revision=8675633, delete=false"

Logs look to be spammed with that.

ktsakalozos commented 1 year ago

The "TRIGGERED" message is caused by the watchers monitoring the respective keys. It comes from https://github.com/canonical/kine/blob/v0.4.1-k8s-dqlite.3/pkg/logstructured/sqllog/sql.go#L438

thesuperzapper commented 11 months ago

For those watching/maintaining, trying to install deployKF can sometimes trigger this behavior, probably because deployKF makes a LOT of Kubernetes API calls during the first install.

Read more in my write-up here: https://github.com/deployKF/deployKF/issues/39#issuecomment-1741646935

Personally, I think microk8s is currently not safe to use because it has the possibility of getting into this state, so I recommend k3s instead.

priyashpatil commented 3 weeks ago

Even on single node cluster it crashes after reboot. I'm moving to k3s.

canonical / microk8s