Cluster out of control - Githubissues

djjudas21 commented 2 years ago

Please run microk8s inspect and attach the generated tarball to this issue.

inspection-report-20211114_192139.tar.gz

I reported #2723 earlier about cluster problems since upgrading to v1.21.6, and reverting. I have experienced weird problems that seem like the cluster might have gone split-brain at some point?? Weird inconsistencies with pods etc. No pods could be scheduled or even added to the list, and whichever node I pointed my kubectl at, would give intermittent errors The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port? Nodes were never marked as NotReady even I powered them off.

I was unable to repair the cluster (originally 5 nodes) so I have gradually removed each node from the cluster to leave just 1 node to work on, and leave the cluster in a consistent state. In turn I ran kubectl drain ... and microk8s leave on the node but this never had any effect on actually draining any pods. In each case I also had to run microk8s remove-node and then manually clean up all the orphaned pods with kubectl delete pod --force.

I've now ended up with a single node, and my plan is to restore stability, rebuild all the other nodes, and then add them back so I can scale up again. However even on a single node I can't schedule any new workloads and I'm still getting weird problems, like the DNS deployment which says it says 2 pods,

# DNS claims it has 2 pods
jonathan@kube03:~$ microk8s kubectl -n kube-system get deploy
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server            1/1     1            1           186d
hostpath-provisioner      1/1     1            1           277d
calico-kube-controllers   0/1     1            0           285d
coredns                   2/2     2            2           285d

# DNS has no pods really
jonathan@kube03:~$ microk8s kubectl -n kube-system get po
NAME                          READY   STATUS    RESTARTS   AGE
node-problem-detector-b2ftf   1/1     Running   3          27d
calico-node-rpc5d             1/1     Running   4          18h

# Let's scale it down
jonathan@kube03:~$ microk8s kubectl -n kube-system scale deploy coredns --replicas=0
deployment.apps/coredns scaled

# WTF
jonathan@kube03:~$ microk8s kubectl -n kube-system get deploy
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server            1/1     1            1           186d
hostpath-provisioner      1/1     1            1           277d
calico-kube-controllers   0/1     1            0           285d
coredns                   2/0     2            2           285d

jonathan@kube03:~$ microk8s kubectl -n kube-system get po
NAME                          READY   STATUS    RESTARTS   AGE
node-problem-detector-b2ftf   1/1     Running   3          27d
calico-node-rpc5d             1/1     Running   4          18h

So far I have resisted trashing the whole cluster. I do have Helm charts to put all my workloads back, but I have many PVs configured on an external storage appliance and I don't know how to reconnect to my volumes on a fresh cluster.

So I have no idea what's going on with this but I need to make it stable so I can restore service and scale up again.

djjudas21 commented 2 years ago

Upgraded to v1.22.3, but no change

djjudas21 commented 2 years ago

jonathan@kube03:~$ microk8s disable dns
Addon dns is already disabled.

jonathan@kube03:~$ microk8s enable dns
Traceback (most recent call last):
  File "/snap/microk8s/2645/scripts/wrappers/enable.py", line 43, in <module>
    enable(prog_name="microk8s enable")
  File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/snap/microk8s/2645/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/snap/microk8s/2645/scripts/wrappers/enable.py", line 36, in enable
    enabled_addons, _ = get_status(get_available_addons(get_current_arch()), True)
  File "/snap/microk8s/2645/scripts/wrappers/status.py", line 157, in get_status
    kube_output = kubectl_get("all")
  File "/snap/microk8s/2645/scripts/wrappers/common/utils.py", line 169, in kubectl_get
    return run("kubectl", kubeconfig, "get", cmd, "--all-namespaces", die=False)
  File "/snap/microk8s/2645/scripts/wrappers/common/utils.py", line 39, in run
    result.check_returncode()
  File "/snap/microk8s/2645/usr/lib/python3.6/subprocess.py", line 389, in check_returncode
    self.stderr)
subprocess.CalledProcessError: Command '('kubectl', '--kubeconfig=/var/snap/microk8s/2645/credentials/client.config', 'get', 'all', '--all-namespaces')' returned non-zero exit status 1.

teamosceola commented 2 years ago

I'm having the exact same problem. I have a 5 node cluster with ha, ingress, and dns addons enabled. I have snap configured on my nodes to only refresh on the last Friday of the month. Was on v1.21.5 (Rev 2546), then snap updated to v1.21.7 (Rev 2694) 2 days ago and broke everything. Having the same issue with the coredns deployment as described by @djjudas21 I've tried a snap revert back to v1.21.5 (Rev 2546) but still having cluster wide issues even after reverting back to what was a working revision.

ktsakalozos commented 2 years ago

Hi @teamosceola you may want to look at this reply https://github.com/ubuntu/microk8s/issues/2723#issuecomment-968611780

Could you please attach the inspection tarball of your nodes so we can get a hint of how exactly the cluster is failing?

teamosceola commented 2 years ago

Hi @teamosceola you may want to look at this reply #2723 (comment)

Could you please attach the inspection tarball of your nodes so we can get a hint of how exactly the cluster is failing?

inspection-report-20211129_055843.tar.gz

It's probably not memory usage, each node has 12 GB and average usage was ~35% prior to the snap updating.

The nodes are VMWare vSphere virtual machines running on an all flash vSAN array, so it could be I/O, but probably not.

I also ran through this Recovery of HA Microk8s procedure twice without any improvement.

I'm also frequently/intermittently getting this error message when running microk8s kubectl from the nodes:

The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?

Here is an example output of the strange incoherent state the cluster is in:

user@node-01:~$ kubectl get all
NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/kubernetes             ClusterIP   10.152.183.1     <none>        443/TCP             178d
service/awx-operator-metrics   ClusterIP   10.152.183.161   <none>        8383/TCP,8686/TCP   6d12h
service/awx-postgres           ClusterIP   None             <none>        5432/TCP            6d12h
service/awx-service            ClusterIP   10.152.183.31    <none>        80/TCP              6d12h

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/awx-operator   1/0     1            1           6d12h
deployment.apps/awx            1/0     1            1           6d12h

NAME                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/awx-operator-75c79f5489   1         1         1       6d12h
replicaset.apps/awx-54bfbb7cd5            1         1         1       6d12h

NAME                            READY   AGE
statefulset.apps/awx-postgres   1/0     6d12h

ktsakalozos commented 2 years ago

Thank you for the quick response @teamosceola. In the attached logs I see long delays in writing data to disk. See for example:

Nov 29 05:57:18 lab-mgmt-k8s-01 microk8s.daemon-kubelite[7882]: Trace[1914169171]: ---"About to write a response" 18549ms (05:57:00.369)

You can check these logs yourself with journalctl -fu snap.microk8s.daemon-kubelite.

Could you please share the contents of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml so as to see all the members of the datastore (dqlte) cluster. Also, we would like to know which node is acting as the leader and thus is responsible for persisting the data entries. Could you share the output of:

sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"

On the node acting as the leader we may want to run a few benchmarks on the storage layer to see how much stressed it is. We can try with some generic tests:

dd if=/dev/zero of=/var/snap/microk8s/current/var/kubernetes/backend/test1.img bs=1G count=1 oflag=dsync
hdparm -Tt /dev/mapper/ubuntu--vg-ubuntu--lv

And also measure the block I/O latency by running bcc.biolatency while a dd if=/dev/zero of=test.bin bs=4M count=200 conv=fdatasync is on the background. See https://github.com/ubuntu/microk8s/issues/2285#issuecomment-847688045 .

djjudas21 commented 2 years ago

Just to feed back on my experience with this problem. Unfortunately for me it ended in data loss because the TrueNAS CSI provisioner and/or kube-api ended up in a state where it was not aware of any PersistentVolumeClaims or PersistentVolumes, so it deleted all my volumes on my TrueNAS appliance, including their linked backup snapshots.

With all my data gone (except for a couple of off-cluster backups) I had no incentive to try and troubleshoot or repair the broken cluster, so I wiped and reinstalled the OS on my nodes, reprovisioned my cluster from scratch and redeployed all my workloads from their charts/manifests.

Sorry I couldn't be of more help diagnosing the fault, but in common with @teamosceola my nodes were not stressed for memory pressure or I/O so I'm not sure what event might have caused dqlite to go rogue.

teamosceola commented 2 years ago

Contents of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml:

- Address: 10.60.0.115:19001
  ID: 3297041220608546238
  Role: 0
- Address: 10.60.0.116:19001
  ID: 12546450102583263094
  Role: 0
- Address: 10.60.0.117:19001
  ID: 7890313187901146961
  Role: 0
- Address: 10.60.0.118:19001
  ID: 2281304059293145976
  Role: 1
- Address: 10.60.0.119:19001
  ID: 16971314842328136315
  Role: 1

Output of sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader":

10.60.0.116:19001

NOTE: This output was 10 hours after the inspection-report was generated

So the block I/O tests did point out some vSAN config issues that we have someone working on fixing. Assuming the I/O issues get fixed, @ktsakalozos Will the cluster recover? Is it recoverable?

teamosceola commented 2 years ago

@ktsakalozos so we got our I/O issues fixed, but the cluster is still messed up, any thoughts? Any more info we could provide?

Here are the latest I/O test results:

user@10.60.0.116:~$ dd if=/dev/zero of=/var/snap/microk8s/current/var/kubernetes/backend/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.95135 s, 272 MB/s

user@10.60.0.116:~$ sudo bcc.biolatency 
Tracing block device I/O... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 3        |                                        |
        64 -> 127        : 17       |**                                      |
       128 -> 255        : 9        |*                                       |
       256 -> 511        : 13       |**                                      |
       512 -> 1023       : 108      |******************                      |
      1024 -> 2047       : 73       |************                            |
      2048 -> 4095       : 152      |**************************              |
      4096 -> 8191       : 113      |*******************                     |
      8192 -> 16383      : 111      |*******************                     |
     16384 -> 32767      : 228      |****************************************|
     32768 -> 65535      : 196      |**********************************      |
     65536 -> 131071     : 150      |**************************              |
    131072 -> 262143     : 99       |*****************                       |
    262144 -> 524287     : 49       |********                                |
    524288 -> 1048575    : 13       |**                                      |
   1048576 -> 2097151    : 3        |                                        |

ktsakalozos commented 2 years ago

Hi @djjudas21 could you share a new inspection report? I would like to see if the problem persists.

djjudas21 commented 2 years ago

Sure @ktsakalozos. This is an inspection report from my rebuilt cluster. In the end I kept the OS intact on my nodes, ran snap remove microk8s and purged everything. Then snap install microk8s and jumped through the hoops to rejoin all the nodes to the cluster.

inspection-report-20211205_205529.tar.gz

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

canonical / microk8s

Cluster out of control #2724