Open djjudas21 opened 1 year ago
This morning I'm also seeing this:
[jonathan@poseidon-gazeley-uk monitoring]$ kubectl get no
E0207 09:21:13.890135 1404195 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.896982 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.898445 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0207 09:21:13.899407 1404195 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME STATUS ROLES AGE VERSION
kube08 Ready <none> 14d v1.26.0
kube06 Ready <none> 14d v1.26.0
kube07 Ready <none> 15h v1.26.0
kube05 Ready <none> 14d v1.26.0
Replication controller is broken too... you can't scale things down (here I was trying to scale down ownCloud while my cluster is inconsistent)
[jonathan@poseidon-gazeley-uk ~]$ kubens owncloud
Context "microk8s" modified.
Active namespace is "owncloud".
[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME READY STATUS RESTARTS AGE
owncloud-redis-master-0 1/1 Running 1 (3d14h ago) 14d
owncloud-7fdd84f69b-7zcr2 0/1 Error 524 13d
[jonathan@poseidon-gazeley-uk ~]$ kubectl get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
owncloud 1/1 1 1 14d
# This took a long time
[jonathan@poseidon-gazeley-uk ~]$ kubectl scale deploy owncloud --replicas=0
deployment.apps/owncloud scaled
# What is going on here?!
[jonathan@poseidon-gazeley-uk ~]$ kubectl get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
owncloud 1/0 1 1 14d
[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME READY STATUS RESTARTS AGE
owncloud-redis-master-0 1/1 Running 1 (3d14h ago) 14d
owncloud-7fdd84f69b-7zcr2 0/1 Error 524 13d
All nodes are printing a lot of errors in their inspection-report/snap.microk8s.daemon-k8s-dqlite/journal.log
like
Feb 06 00:21:37 kube08 microk8s.daemon-k8s-dqlite[11548]: time="2023-02-06T00:21:37Z" level=error msg="error in txn: query (try: 0): context canceled"
Does this point to a loss of dqlite quorum?
Hi @djjudas21 could you remove the offending node (kube07) with microk8s leave
and then from a node in the cluster do a microks8 remove-node kube07
so that the cluster knows that node is expected to be removed. After this clean removal you should be able to re-join the node.
Thanks @ktsakalozos. I already tried this, but I will try again if you think it will help
Well, I removed kube07
and I also decided to remove kube08
because that way the HA mode will be disabled and hopefully the quorum issue will be resolved.
kube07
was removed easily because no pods ever got scheduled on it last time I re-added it.
kube08
was removed by doing microk8s leave
and then running microks8 remove-node kube07
from kube05
. It appeared to leave cleanly but all its pods are still reported to be healthy!
[jonathan@poseidon-gazeley-uk photoprism]$ kubectl get po -A -o wide | grep kube08
openebs openebs-ndm-node-exporter-rq62w 1/1 Running 0 14d 10.1.13.197 kube08 <none> <none>
openebs openebs-ndm-zd8lm 1/1 Running 0 14d 192.168.0.56 kube08 <none> <none>
openebs openebs-cstor-csi-node-zq8fm 2/2 Running 0 14d 192.168.0.56 kube08 <none> <none>
openebs cstor-disk-pool-kgb4-6ff874bd7b-c5sxb 3/3 Running 0 14d 10.1.13.199 kube08 <none> <none>
democratic-csi truenas-democratic-csi-node-rkmq8 4/4 Running 0 14d 192.168.0.56 kube08 <none> <none>
reactive-resume reactive-resume-postgresql-0 1/1 Running 0 3d12h 10.1.13.200 kube08 <none> <none>
kube-system node-problem-detector-spljp 1/1 Running 0 39h 10.1.13.210 kube08 <none> <none>
monitoring prometheus-smartctl-exporter-0-nwppn 1/1 Running 0 39h 192.168.0.56 kube08 <none> <none>
kube-system calico-node-679fj 1/1 Running 0 14d 192.168.0.56 kube08 <none> <none>
owncloud navidrome-jon-cf5fd75fd-v4xsm 1/1 Running 0 7d20h 10.1.13.209 kube08 <none> <none>
metallb-system speaker-vsgtn 1/1 Running 0 3d19h 192.168.0.56 kube08 <none> <none>
monitoring prometheus-stack-prometheus-node-exporter-g5ftd 2/2 Running 0 39h 192.168.0.56 kube08 <none> <none>
camerahub-prod camerahub-static-7976477cbc-nxvv4 1/1 Running 0 13d 10.1.13.202 kube08 <none> <none>
camerahub-prod camerahub-app-6fcd5cb58-95czm 1/1 Running 0 13d 10.1.13.203 kube08 <none> <none>
cert-manager cert-manager-6c6bdff4d7-55zsh 1/1 Running 394 (14m ago) 3d19h 10.1.13.215 kube08 <none> <none>
openebs openebs-cstor-csi-controller-0 5/6 Error 901 14d 10.1.13.198 kube08 <none> <none>
kube-system metrics-server-6f754f88d-ncvd4 0/1 Terminating 0 3d15h <none> kube08 <none> <none>
ingress nginx-ingress-microk8s-controller-8d5bf 1/1 Running 0 14d 10.1.13.193 kube08 <none> <none>
I confirmed that kube05
is the only remaining datastore, and I am running these kubectl
commands against kube05
jonathan@kube05:~$ microk8s status
microk8s is running
high-availability: no
datastore master nodes: 192.168.0.57:19001
datastore standby nodes: none
I am concerned about doing anything more destructive at the moment because I am using OpenEBS CStor volumes, which places 3 volume replicas across my 4 nodes. Therefore it is safe to destroy 2 nodes but not 3 (it can recover from 1 replica). I also can't accurately verify or migrate my CStor replicas because of the api-server.
Now kube05
is the only master datastore, what about stopping and starting microk8s and see if it returns to a normal state?
I've left it a few hours and the cluster hasn't settled down - the pods on the now-absent kube08
are still reported as Running 1/1 and not marked as missing or unhealthy. k8s-dqlite
is running at 100% CPU on kube05
(the master). Not sure if I'm hitting #3227
I still can't figure out what's going on with this. I don't have control of my cluster, the dqlite master is running at 100% CPU. I am reluctant to do anything really invasive because I have OpenEBS/CStor volumes on my nodes which obviously depend on the api-server for their own quorum.
My gut instinct is to destroy the cluster and recreate from scratch to restore service, but I had to do this 3 weeks ago too (https://github.com/canonical/microk8s/issues/3204#issuecomment-1398873620), and it just broke again on its own. I don't have confidence in recent releases of MicroK8s to be stable and durable, especially with hyperconverged storage.
At the moment I'm not touching anything but I'm thinking of improving my off-cluster storage first, and rebuilding without using OpenEBS/CStor, for the security of my data.
I appreciate that MicroK8s is free and that people put a lot of hard work into it, but I feel that there are some serious problems that aren't getting enough attention. There are several open issues about reliability problems potentially related to dqlite that remain open:
@djjudas21 a MicroK8s cluster starts with a single node to which you join other nodes to form a multinode cluster. As soon as the cluster has three control plane nodes (nodes joined without the --worker
flag) it becomes HA. HA means that the datastore is replicated across three nodes and will resist one node failure without disrupting its operation. In an HA cluster you have to make sure you have at least three control plane nodes running all the time. If you drop below 2 nodes the cluster will freeze because there is no quorum to replicate the datastore. In this case (an HA cluster left with a single node) you can recover following the instructions in Recover from lost quorum page. This should explain why when we left the cluster with one node (kube05
) you where able to read its state but not change it. That is, you could see the kubectl output but you cloud not do anything on the cluster. In such a frozen cluster the datastore logs (journalctl -fu snap.microk8s.daemon-k8s-dqlite
) report errors like context deadline exceeded
and database locked
.
As we said the first 3 nodes in an HA cluster replicate the datastore. Any write on the datastore needs to be acked by the majority of the nodes, the quorum. As we scale the control-plane nodes beyond 3 the next two nodes maintain a replica of the datastore and are in standby in case a node from the first three departs. When a node misses some heartbeats it gets replaced and the rest of the nodes agree on the role each node plays. This node is stored in /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
. In the case of the cluster we have:
kube05
- Address: 192.168.0.57:19001
ID: 3297041220608546238
Role: 0
kube06
- Address: 192.168.0.58:19001
ID: 796923914728165793
Role: 0
kube08
- Address: 192.168.0.56:19001
ID: 3971249779275177663
Role: 0
kube07
- Address: 192.168.0.59:19001
ID: 16754272739278091721
Role: 1
Node kube07
has a role of 1 meaning it is standby (replicating the datastore but not participating in the quorum). The rest of the nodes have role 0 meaning they are voters/part of the quorum.
The nodes cannot change IP. Even if they have crashed or left the cluster or are misbehaving they are still considered part of the cluster because they may be rebooting or have some network connectivity problems or go through maintenance etc. If a know a node will not be coming back in its previous state we must call mcirok8s remove-node
to let the cluster know that specific node has departed permanently. It is important to point out that if we, for example, "format and reinstall" a node and reuse the hostname and IP this should be considered a departed node so it has to be microk8s remove-node
before we join it again.
Of course the above do not explain why the cluster initially started to freeze. Hopefully it may give you some insight on what it happening under the hood. I am not sure what happened around "Feb 06 00:43:43". Will keep digging in the logs. A way to reproduce the issue would be ideal.
One question I have is why do I see in the dmesg logs on a couple of nodes errors like this:
[1134653.101082] EXT4-fs error (device sdb): __ext4_find_entry:1658: inode #2: comm navidrome: reading directory lblock 0
Thanks @ktsakalozos for the detailed explanation, I appreciate it.
I had incorrectly assumed that if you have an HA cluster and you remove nodes to return it to 1 remaining control plane node, it would resume quorum automatically because it's the only node. This explains why removing nodes made my problem worse. I understand about having to do microk8s leave
and microk8s remove-node
, and that reinstalling makes it a "new" node.
I didn't know about the api-server becoming silently read-only in this situation. That certainly explains some of my symptoms. Is it possible to make this failure mode more prominent? Like every kubectl
command should give a warning, and maybe there could be a Prometheus rule to catch this. The user needs to be aware of this failure situation so they don't do what I did and make it worse :facepalm:
In the error snippet about /dev/sdb
, this device is an OpenEBS/CStor volume mounted over iSCSI, not a physical device in the node. I suspect it is reporting errors because OpenEBS relies on the cluster to manage itself, and the broken quorum damaged some of the volumes and took them offline. I think this is a symptom, not a cause.
So the underlying problem is the loss of quorum. At Feb 06 00:43:43 I was asleep and not working on the cluster. The uptime of the nodes shows there was no power outage etc. I have two theories:
1.26/stable
channel but this had an update around this time and maybe >=2 nodes updated at the same time.1.26/stable: v1.26.1 2023-02-07 (4595) 176MB classic
I will look at snap set system refresh.timer
to see if I can get my nodes to update on different days to avoid this.
All 4 of my kube nodes have the same default snap refresh window of 00:00, 4 times during the day.
jonathan@kube05:~$ snap refresh --time
timer: 00:00~24:00/4
last: today at 06:19 UTC
next: today at 14:49 UTC
Snap only keeps a log of successful refresh events for 24 hours so I can't tell if my nodes updated simultaneously, but it seems likely given that they all have the same window.
This is my workaround:
jonathan@kube05:~$ sudo snap set system refresh.timer=mon
jonathan@kube06:~$ sudo snap set system refresh.timer=tue
jonathan@kube07:~$ sudo snap set system refresh.timer=wed
jonathan@kube08:~$ sudo snap set system refresh.timer=thu
I think it is worth documenting this somewhere, because it is a simple workaround for a serious problem. I don't know much about how Snap works but it looks like there are several options you can set which could have prevented my failure. Are you able to set any of these options on the MicroK8s package as a Snap package maintainer?
I do not see a snap refresh around the time of the incident. I do not even see a service restart let alone a refresh. Did you see anything that leads you to believe it is a snap refresh related issue?
Is it possible that at some point any of the nodes run out of disk space? I see in the inspection scripts that the nodes report that they do not have disk pressure but kube07 has "lastTransitionTime" to NoDiskPressure state at "2023-02-06T17:36:22Z" and kube06 the "lastTransitionTime" to NoDiskPressure is "2023-02-06T04:34:55Z". Is there a way for us to know what was the disks state at that point? Is there a workload that could consume too much disk space?
It was just a suspicion, based on that snap auto-refreshes overnight.
It seems unlikely about them running out of disk. They all have plenty of free space and I don't think any workloads use ephemeral storage in any real way. I do have monitoring for disk space and that wasn't firing an alert, but I have since lost my Prometheus volume so I can't check historical data.
jonathan@kube05:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 98G 31G 63G 33% /
jonathan@kube06:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 107G 35G 68G 34% /
jonathan@kube07:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 226G 16G 199G 8% /
jonathan@kube08:~$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 116G 23G 88G 21% /
So I ran through the guide to recover from lost quorum, with 2 nodes (kube05
and kube06
) and the following cluster.yaml
:
- Address: 192.168.0.57:19001
ID: 3297041220608546238
Role: 0
- Address: 192.168.0.58:19001
ID: 796923914728165793
Role: 2
I started up both nodes but there is no change in behaviour. kube05
(the master) has dqlite consuming 100% CPU again, calls to the api-server with kubectl
sometimes time out, and the dqlite log continues to show
Feb 09 13:38:46 kube05 microk8s.daemon-k8s-dqlite[3409946]: time="2023-02-09T13:38:46Z" level=error msg="error in txn: query (try: 0): context canceled"
Feb 09 13:38:49 kube05 microk8s.daemon-k8s-dqlite[3409946]: time="2023-02-09T13:38:49Z" level=error msg="error in txn: query (try: 0): context deadline exceeded"
So I don't know what I did wrong but it looks like the dqlite is still read-only. Maybe worth killing kube06
and just starting up with kube05
as a single node cluster?
I tried again to restore kube05
as a single node on its own, still read-only dqlite. Any ideas how to restore quorum? I don't really want to to start from scratch because I will lose the contents of my cStor volumes
@djjudas21 could you share a new inspection tarball? I wonder what logs are saying.
Do you think it would be possible to somehow share the contents of the /var/snap/microk8s/current/var/kubernetes/backend/
directory. Note this is the data from the datastore. I am wondering if the 100% cpu is because of a data corruption we are no seeing.
Corruption seems plausible. It took 3m 45s to create this inspection report, and it seemed to be stuck on the dqlite step.
inspection-report-20230209_152432.tar.gz
I could share the /var/snap/microk8s/current/var/kubernetes/backend/
somewhere privately, but it is almost 1GB and contains secrets. Let me try and find somewhere
@djjudas21 would it be possible to append the --debug
flag in /var/snap/microk8s/current/args/k8s-dqlite
restart the dqlite service with sudo systemctl restart snap.microk8s.daemon-k8s-dqlite
and after we see the 100% cpu utilization to produce a new inspection report?
@ktsakalozos we've actually had a couple of power outages in the area tonight so I would've lost quorum anyway! :sweat_smile:
The nodes are all powered back on now, I've added the --debug
flag to dqlite and restarted it. High CPU usage started immediately, so I let it run for a few minutes before producing the new inspection report. Thanks
@djjudas21 could you run a quick test? While dqlite is in a 100% CPU utilization do a sudo systemctl stop snap.microk8s.daemon-kubelite
and wait a few moments to see if the dqlite CPU utilization drops. I would like to check if the CPU usage is due to the load k8s is producing or if dqlite is consuming CPU cycles in some internal loop.
@djjudas21
Can you perform
echo '
LIBRAFT_TRACE=1
LIBDQLITE_TRACE=1
UV_THREADPOOL_SIZE=64
' | sudo tee -a /var/snap/microk8s/current/args/k8s-dqlite-env
sudo snap restart microk8s.daemon-k8s-dqlite
like described here https://github.com/canonical/microk8s/issues/3419#issuecomment-1397445789 please?
It's the same command as in the link, but with the extra UV_THREADPOOL_SIZE=64
, which is a long shot, but might help.
Can you then please provide the dqlite logs of a period with high CPU usage (couple of minutes is fine)? Don't forget to remove the env variables from the file again as the *_TRACE
variables will spam the logs.
edit: it would be interesting to have those logs with kubelite
stopped like @ktsakalozos suggested as they would then contain more relevant information.
@ktsakalozos
Output of top
before stopping kubelite:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1258 root 20 0 5559608 3.2g 19612 S 102.0 20.9 790:24.38 k8s-dqlite
777662 root 20 0 1348340 606044 94600 S 12.2 3.7 0:12.88 kubelite
And after stopping kubelite:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1258 root 20 0 5559612 3.2g 19612 S 101.3 20.9 793:23.84 k8s-dqlite
So, no change in dqlite CPU usage.
@MathieuBordere
Thanks. I added the env vars and restarted dqlite at 10:07. Then I let it run for a minute or two and ran microk8s inspect
@djjudas21 Can you do the same with kubelite
turned off please? Would be interesting to see the difference.
Here you go @MathieuBordere inspection-report-20230210_102606.tar.gz
@ktsakalozos Do you know who/what causes
Feb 10 10:25:18 kube05 microk8s.daemon-k8s-dqlite[798763]: time="2023-02-10T10:25:18Z" level=debug msg="TRIGGERED /registry/crd.projectcalico.org/ipamhandles/k8s-pod-network.92488c470d1d0d13de2610a5f999cca354baa6c1400005f698706a0175f3e34d, revision=8675633, delete=false"
Logs look to be spammed with that.
The "TRIGGERED" message is caused by the watchers monitoring the respective keys. It comes from https://github.com/canonical/kine/blob/v0.4.1-k8s-dqlite.3/pkg/logstructured/sqllog/sql.go#L438
For those watching/maintaining, trying to install deployKF can sometimes trigger this behavior, probably because deployKF makes a LOT of Kubernetes API calls during the first install.
Read more in my write-up here: https://github.com/deployKF/deployKF/issues/39#issuecomment-1741646935
Personally, I think microk8s
is currently not safe to use because it has the possibility of getting into this state, so I recommend k3s
instead.
Even on single node cluster it crashes after reboot. I'm moving to k3s.
Summary
Something has gone wrong with (I think) the api-server or dqlite backend. Cluster state is inconsistent and replication controllers are not working.
I have 4 nodes (
kube05
-kube08
) and earlier today I noticed that some of my workloads stopped. I saw thatkube07
had some stuck processes and 100% CPU usage on it, so I attempted to drain it and reboot it. It never drained properly; all the pods that were onkube07
went toUnknown
state and never got rescheduled.I did
microk8s leave
onkube07
but all thekube07
resources remained on the cluster. I force-deleted all of those pods but they never got rescheduled. Now I re-addedkube07
but nothing is being scheduled on it, even where there are daemonsets etc.For example, in the
monitoring
namespace there are 2 daemonsets and both are broken in different ways:There are only 3 of the
prometheus-smartctl-exporter
and only 2 of theprometheus-stack-prometheus-node-exporter
, but the Desired/Current/Ready status is wrong. Something is obviously seriously wrong with the Kubernetes control plane, but I can't figure out what.What Should Happen Instead?
Reproduction Steps
I can't consistently reproduce.
Introspection Report
microk8s inspect
took a long time to run on all nodes, but reported no errors. It waskube07
that was removed and re-added to the cluster.kube05-inspection-report-20230206_185244.tar.gz
kube06-inspection-report-20230206_185337.tar.gz
kube07-inspection-report-20230206_185248.tar.gz
kube08-inspection-report-20230206_185353.tar.gz
Can you suggest a fix?
It may be related to #2724 which I reported over a year ago, and was never resolved
Are you interested in contributing with a fix?
Yes, if I can. I feel like this is a serious and ongoing problem with MicroK8s