HA not working, Nodes shows "NotReady" Status when datastore node is powered off abruptly

DileepAP commented 9 months ago

Kubernetes versions i used is 1.28.3 and 1.29.0 - In Both versions, i faced the same issue

I have a six node HA cluster. microk8s status microk8s is running high-availability: yes datastore master nodes: 10.40.101.83:19001 10.40.101.185:19001 10.40.101.186:19001 datastore standby nodes: 10.40.101.85:19001 10.40.101.128:19001 10.40.101.129:19001

When any of the datastore nodes are shutdown, other nodes moves to NotReady status. This happens occasionally. The shutdown is "hard shutdown" from the hypervisor console

It takes around 20 mints to recover the nodes (expect the one which was powered off) automatically, and the applications are not accessible during this window.

I expect all the nodes to be in "Ready" state, other than the one which is powered off

Any fix for this issue..?

xaa@ha-02:~$ date Fri 2 Feb 12:28:23 UTC 2024 xaa@ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ha-05 Ready 66m v1.29.0 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-01 Ready 83m v1.29.0 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-06 Ready 61m v1.29.0 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-04 Ready 71m v1.29.0 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-03 Ready 75m v1.29.0 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 ha-02 Ready 78m v1.29.0 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 xaa@ha-02:~$

Every 2.0s: kubectl get nodes ha-03: Fri Feb 2 12:30:04 2024

NAME STATUS ROLES AGE VERSION ha-03 Ready 76m v1.29.0 ha-02 Ready 79m v1.29.0 ha-06 NotReady 62m v1.29.0 ha-01 NotReady 85m v1.29.0 ha-04 NotReady 73m v1.29.0 ha-05 NotReady 68m v1.29.0

DileepAP commented 9 months ago

baa@blr-brd-ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME blr-brd-ha-04 Ready 3h28m v1.28.3 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-03 Ready 3h30m v1.28.3 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-06 Ready 3h24m v1.28.3 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-05 Ready 3h26m v1.28.3 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-02 Ready 3h32m v1.28.3 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-01 Ready 3h35m v1.28.3 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:27:54 UTC 2024 baa@blr-brd-ha-02:~$ microk8s status microk8s is running high-availability: yes datastore master nodes: 10.40.101.85:19001 10.40.101.186:19001 10.40.101.128:19001 datastore standby nodes: 10.40.101.83:19001 10.40.101.185:19001 10.40.101.129:19001 addons: enabled: dns # (core) CoreDNS ha-cluster # (core) Configure high availability on the current node helm # (core) Helm - the package manager for Kubernetes helm3 # (core) Helm 3 - the package manager for Kubernetes ingress # (core) Ingress controller for external access metallb # (core) Loadbalancer for your Kubernetes cluster metrics-server # (core) K8s Metrics Server for API access to service metrics minio # (core) MinIO object storage disabled: cert-manager # (core) Cloud native certificate management cis-hardening # (core) Apply CIS K8s hardening community # (core) The community addons repository dashboard # (core) The Kubernetes dashboard gpu # (core) Automatic enablement of Nvidia CUDA host-access # (core) Allow Pods connecting to Host services smoothly hostpath-storage # (core) Storage class; allocates storage from host directory kube-ovn # (core) An advanced network fabric for Kubernetes mayastor # (core) OpenEBS MayaStor observability # (core) A lightweight observability stack for logs, traces and metrics prometheus # (core) Prometheus operator for monitoring and logging rbac # (core) Role-Based Access Control for authorisation registry # (core) Private image registry exposed on localhost:32000 rook-ceph # (core) Distributed Ceph storage using Rook storage # (core) Alias to hostpath-storage add-on, deprecated baa@blr-brd-ha-02:~$ ###############################################################################################

Node "blr-brd-ha-04" was shutdown, but node "blr-brd-ha-03" also went to "NotReady" status. At times, it use to take down even 4 other nodes also. The node was made down by around 15:27

############################################################################################### baa@blr-brd-ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME blr-brd-ha-01 Ready 3h38m v1.28.3 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-06 Ready 3h27m v1.28.3 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-05 Ready 3h29m v1.28.3 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-03 NotReady 3h33m v1.28.3 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-02 Ready 3h35m v1.28.3 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-04 NotReady 3h31m v1.28.3 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:30:41 UTC 2024 baa@blr-brd-ha-02:~$

baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:45:33 UTC 2024 baa@blr-brd-ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME blr-brd-ha-04 NotReady 3h46m v1.28.3 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-01 Ready 3h53m v1.28.3 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-03 NotReady 3h48m v1.28.3 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-06 Ready 3h42m v1.28.3 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-05 Ready 3h44m v1.28.3 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-02 Ready 3h50m v1.28.3 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:45:37 UTC 2024 baa@blr-brd-ha-02:~$

baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:48:16 UTC 2024 baa@blr-brd-ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME blr-brd-ha-04 NotReady 3h49m v1.28.3 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-06 Ready 3h45m v1.28.3 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-05 Ready 3h46m v1.28.3 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-01 NotReady 3h56m v1.28.3 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-02 NotReady 3h52m v1.28.3 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-03 Ready 3h50m v1.28.3 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:48:21 UTC 2024 baa@blr-brd-ha-02:~$

baa@blr-brd-ha-03:~$ kubectl describe node blr-brd-ha-03 Lease: HolderIdentity: blr-brd-ha-03 AcquireTime: RenewTime: Mon, 05 Feb 2024 15:39:53 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Mon, 05 Feb 2024 14:37:50 +0000 Mon, 05 Feb 2024 14:37:50 +0000 CalicoIsUp Calico is running on this node MemoryPressure Unknown Mon, 05 Feb 2024 15:39:24 +0000 Mon, 05 Feb 2024 15:29:53 +0000 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Mon, 05 Feb 2024 15:39:24 +0000 Mon, 05 Feb 2024 15:29:53 +0000 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Mon, 05 Feb 2024 15:39:24 +0000 Mon, 05 Feb 2024 15:29:53 +0000 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 05 Feb 2024 15:39:24 +0000 Mon, 05 Feb 2024 15:29:53 +0000 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.40.101.185 Hostname: blr-brd-ha-03

Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits

cpu 1150m (14%) 1500m (18%) memory 290Mi (0%) 1114Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: baa@blr-brd-ha-03:~$

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ Nodes are back after 20 plus mints

Mon 5 Feb 15:54:06 UTC 2024 baa@blr-brd-ha-02:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME blr-brd-ha-04 NotReady 3h55m v1.28.3 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-02 Ready 3h58m v1.28.3 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-06 Ready 3h50m v1.28.3 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-05 Ready 3h52m v1.28.3 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-03 Ready 3h56m v1.28.3 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 blr-brd-ha-01 Ready 4h1m v1.28.3 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15 baa@blr-brd-ha-02:~$ date Mon 5 Feb 15:54:16 UTC 2024 baa@blr-brd-ha-02:~$

DileepAP commented 9 months ago

inspection-report-20240205_153156.tar.gz Inspection Report from node 3

nik0811 commented 8 months ago

Actually am facing the same issue, haveing 3 node master and 3 worker node cluster, worker nodes remains in ready state but master went into Notready state. IP have static ip to the nodes so networking is not an issue. I am planning to switch my cluster now on k3s. Microk8s is destructive in case of abrupt power failure.

geocomm-cwillard commented 1 month ago

I am experiencing the same issue. I have a 6-node cluster, with 3 nodes as the master and 3 as workers. I am using Ubuntu 22.04 and microk8s 1.29.4. When I bring down the master node that is the leader for the dqlite cluster, I notice that some of my other nodes show as not ready in the cluster status. This status persists for about 16 to 17 minutes, after which the cluster reports only one node as offline.

canonical / microk8s

HA not working, Nodes shows "NotReady" Status when datastore node is powered off abruptly #4394