Closed djjudas21 closed 1 year ago
Upgraded to v1.22.3, but no change
jonathan@kube03:~$ microk8s disable dns
Addon dns is already disabled.
jonathan@kube03:~$ microk8s enable dns
Traceback (most recent call last):
File "/snap/microk8s/2645/scripts/wrappers/enable.py", line 43, in <module>
enable(prog_name="microk8s enable")
File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/snap/microk8s/2645/scripts/wrappers/enable.py", line 36, in enable
enabled_addons, _ = get_status(get_available_addons(get_current_arch()), True)
File "/snap/microk8s/2645/scripts/wrappers/status.py", line 157, in get_status
kube_output = kubectl_get("all")
File "/snap/microk8s/2645/scripts/wrappers/common/utils.py", line 169, in kubectl_get
return run("kubectl", kubeconfig, "get", cmd, "--all-namespaces", die=False)
File "/snap/microk8s/2645/scripts/wrappers/common/utils.py", line 39, in run
result.check_returncode()
File "/snap/microk8s/2645/usr/lib/python3.6/subprocess.py", line 389, in check_returncode
self.stderr)
subprocess.CalledProcessError: Command '('kubectl', '--kubeconfig=/var/snap/microk8s/2645/credentials/client.config', 'get', 'all', '--all-namespaces')' returned non-zero exit status 1.
I'm having the exact same problem. I have a 5 node cluster with ha, ingress, and dns addons enabled. I have snap configured on my nodes to only refresh on the last Friday of the month. Was on v1.21.5 (Rev 2546), then snap updated to v1.21.7 (Rev 2694) 2 days ago and broke everything. Having the same issue with the coredns deployment as described by @djjudas21 I've tried a snap revert back to v1.21.5 (Rev 2546) but still having cluster wide issues even after reverting back to what was a working revision.
Hi @teamosceola you may want to look at this reply https://github.com/ubuntu/microk8s/issues/2723#issuecomment-968611780
Could you please attach the inspection tarball of your nodes so we can get a hint of how exactly the cluster is failing?
Hi @teamosceola you may want to look at this reply #2723 (comment)
Could you please attach the inspection tarball of your nodes so we can get a hint of how exactly the cluster is failing?
inspection-report-20211129_055843.tar.gz
It's probably not memory usage, each node has 12 GB and average usage was ~35% prior to the snap updating.
The nodes are VMWare vSphere virtual machines running on an all flash vSAN array, so it could be I/O, but probably not.
I also ran through this Recovery of HA Microk8s procedure twice without any improvement.
I'm also frequently/intermittently getting this error message when running microk8s kubectl
from the nodes:
The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?
Here is an example output of the strange incoherent state the cluster is in:
user@node-01:~$ kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 178d
service/awx-operator-metrics ClusterIP 10.152.183.161 <none> 8383/TCP,8686/TCP 6d12h
service/awx-postgres ClusterIP None <none> 5432/TCP 6d12h
service/awx-service ClusterIP 10.152.183.31 <none> 80/TCP 6d12h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/awx-operator 1/0 1 1 6d12h
deployment.apps/awx 1/0 1 1 6d12h
NAME DESIRED CURRENT READY AGE
replicaset.apps/awx-operator-75c79f5489 1 1 1 6d12h
replicaset.apps/awx-54bfbb7cd5 1 1 1 6d12h
NAME READY AGE
statefulset.apps/awx-postgres 1/0 6d12h
Thank you for the quick response @teamosceola. In the attached logs I see long delays in writing data to disk. See for example:
Nov 29 05:57:18 lab-mgmt-k8s-01 microk8s.daemon-kubelite[7882]: Trace[1914169171]: ---"About to write a response" 18549ms (05:57:00.369)
You can check these logs yourself with journalctl -fu snap.microk8s.daemon-kubelite
.
Could you please share the contents of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
so as to see all the members of the datastore (dqlte) cluster. Also, we would like to know which node is acting as the leader and thus is responsible for persisting the data entries. Could you share the output of:
sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
On the node acting as the leader we may want to run a few benchmarks on the storage layer to see how much stressed it is. We can try with some generic tests:
dd if=/dev/zero of=/var/snap/microk8s/current/var/kubernetes/backend/test1.img bs=1G count=1 oflag=dsync
hdparm -Tt /dev/mapper/ubuntu--vg-ubuntu--lv
And also measure the block I/O latency by running bcc.biolatency
while a dd if=/dev/zero of=test.bin bs=4M count=200 conv=fdatasync
is on the background. See https://github.com/ubuntu/microk8s/issues/2285#issuecomment-847688045 .
Just to feed back on my experience with this problem. Unfortunately for me it ended in data loss because the TrueNAS CSI provisioner and/or kube-api ended up in a state where it was not aware of any PersistentVolumeClaims
or PersistentVolumes
, so it deleted all my volumes on my TrueNAS appliance, including their linked backup snapshots.
With all my data gone (except for a couple of off-cluster backups) I had no incentive to try and troubleshoot or repair the broken cluster, so I wiped and reinstalled the OS on my nodes, reprovisioned my cluster from scratch and redeployed all my workloads from their charts/manifests.
Sorry I couldn't be of more help diagnosing the fault, but in common with @teamosceola my nodes were not stressed for memory pressure or I/O so I'm not sure what event might have caused dqlite to go rogue.
Contents of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
:
- Address: 10.60.0.115:19001
ID: 3297041220608546238
Role: 0
- Address: 10.60.0.116:19001
ID: 12546450102583263094
Role: 0
- Address: 10.60.0.117:19001
ID: 7890313187901146961
Role: 0
- Address: 10.60.0.118:19001
ID: 2281304059293145976
Role: 1
- Address: 10.60.0.119:19001
ID: 16971314842328136315
Role: 1
Output of sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
:
10.60.0.116:19001
NOTE: This output was 10 hours after the inspection-report was generated
So the block I/O tests did point out some vSAN config issues that we have someone working on fixing. Assuming the I/O issues get fixed, @ktsakalozos Will the cluster recover? Is it recoverable?
@ktsakalozos so we got our I/O issues fixed, but the cluster is still messed up, any thoughts? Any more info we could provide?
Here are the latest I/O test results:
user@10.60.0.116:~$ dd if=/dev/zero of=/var/snap/microk8s/current/var/kubernetes/backend/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.95135 s, 272 MB/s
user@10.60.0.116:~$ sudo bcc.biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3 | |
64 -> 127 : 17 |** |
128 -> 255 : 9 |* |
256 -> 511 : 13 |** |
512 -> 1023 : 108 |****************** |
1024 -> 2047 : 73 |************ |
2048 -> 4095 : 152 |************************** |
4096 -> 8191 : 113 |******************* |
8192 -> 16383 : 111 |******************* |
16384 -> 32767 : 228 |****************************************|
32768 -> 65535 : 196 |********************************** |
65536 -> 131071 : 150 |************************** |
131072 -> 262143 : 99 |***************** |
262144 -> 524287 : 49 |******** |
524288 -> 1048575 : 13 |** |
1048576 -> 2097151 : 3 | |
Hi @djjudas21 could you share a new inspection report? I would like to see if the problem persists.
Sure @ktsakalozos. This is an inspection report from my rebuilt cluster. In the end I kept the OS intact on my nodes, ran snap remove microk8s
and purged everything. Then snap install microk8s
and jumped through the hoops to rejoin all the nodes to the cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Please run
microk8s inspect
and attach the generated tarball to this issue.inspection-report-20211114_192139.tar.gz
I reported #2723 earlier about cluster problems since upgrading to v1.21.6, and reverting. I have experienced weird problems that seem like the cluster might have gone split-brain at some point?? Weird inconsistencies with pods etc. No pods could be scheduled or even added to the list, and whichever node I pointed my
kubectl
at, would give intermittent errorsThe connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?
Nodes were never marked asNotReady
even I powered them off.I was unable to repair the cluster (originally 5 nodes) so I have gradually removed each node from the cluster to leave just 1 node to work on, and leave the cluster in a consistent state. In turn I ran
kubectl drain ...
andmicrok8s leave
on the node but this never had any effect on actually draining any pods. In each case I also had to runmicrok8s remove-node
and then manually clean up all the orphaned pods withkubectl delete pod --force
.I've now ended up with a single node, and my plan is to restore stability, rebuild all the other nodes, and then add them back so I can scale up again. However even on a single node I can't schedule any new workloads and I'm still getting weird problems, like the DNS deployment which says it says 2 pods,
So far I have resisted trashing the whole cluster. I do have Helm charts to put all my workloads back, but I have many PVs configured on an external storage appliance and I don't know how to reconnect to my volumes on a fresh cluster.
So I have no idea what's going on with this but I need to make it stable so I can restore service and scale up again.