Closed mathnitin closed 1 month ago
Hello @mathnitin,
Thank you for reporting your issue.
From the inspection report, we can see that both your first node and your second node rebooted (at Aug 01 12:53:40 and Aug 01 13:04:06, respectively), which caused the microk8s snap to restart. When a node goes down, a re-election of the database leader node occurs based on the principles of the Raft algorithm. This re-election process happens over the network.
Could you please describe the network glitch that led to this issue?
@louiseschmidtgen For node3 we disconnected the netwrok adapter for the VM. We did not perform any operations on first node and second node.
Thank you for the additional information @mathnitin.
Would you be willing to reproduce this issue with additional flags enabled?
Please uncomment the flags LIBDQLITE_TRACE=1
and LIBRAFT_TRACE=1
in k8s-dqlite-env which is under /var/snap/microk8s/current/args/
.
Your help in resolving this issue is much appreciated!
@louiseschmidtgen We tried a few options. We reduced the load on our systems, we are running just microk8s and 2 test pods ubuntu (reference: https://gist.github.com/lazypower/356747365cb80876b0b336e2b61b9c26) We are able to reproduce this on 1.28.7
and 1.28.12
both versions.
For collecting the DQLite logs, we tried on 1.28.7
Attached are the logs for same.
Node 1 Inspect report:
node-1-inspection-report-20240802_114324.tar.gz
Node 2 Inspect report: node-2-inspection-report-20240802_114324.tar.gz
Node 3 Inspect report: node-3-inspection-report-20240802_114340.tar.gz
For this run, we disconnected the node3 network and all 3 nodes went into NotReady state after a few minutes. Recovery time is about 15 or so min as before.
core@glop-nm-120-mem1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com Ready <none> 18h v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com Ready <none> 17h v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com Ready <none> 17h v1.28.7
core@glop-nm-120-mem1:~$ watch 'kubectl get nodes'
core@glop-nm-120-mem1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com NotReady <none> 18h v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com NotReady <none> 17h v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com NotReady <none> 17h v1.28.7
core@glop-nm-120-mem1:~$ << approx after 15 minutes >>
core@glop-nm-120-mem1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
glop-nm-120-mem1.glcpdev.cloud.hpe.com Ready <none> 18h v1.28.7
glop-nm-120-mem2.glcpdev.cloud.hpe.com Ready <none> 18h v1.28.7
glop-nm-120-mem3.glcpdev.cloud.hpe.com NotReady <none> 18h v1.28.7
Pod snapshot on the cluster
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default ubuntu1 1/1 Running 2 (17m ago) 17h
default ubuntu2 1/1 Running 2 (17m ago) 17h
kube-system calico-kube-controllers-77bd7c5b-mp5zw 1/1 Running 8 (17m ago) 18h
kube-system calico-node-52jpp 1/1 Running 8 (17m ago) 18h
kube-system calico-node-cxtl4 1/1 Running 8 (17m ago) 17h
kube-system calico-node-tjjqw 1/1 Running 3 (17m ago) 18h
kube-system coredns-7998696dbd-2svgv 1/1 Running 2 (17m ago) 17h
kube-system coredns-7998696dbd-5p899 1/1 Running 3 (17m ago) 17h
kube-system coredns-7998696dbd-7xxpt 1/1 Running 3 (17m ago) 17h
kube-system metrics-server-848968bdcd-jkx6l 1/1 Running 8 (17m ago) 18h
Also for this run, I described the node and collected the output
$ kubectl describe nodes glop-nm-120-mem1.glcpdev.cloud.hpe.com
Name: glop-nm-120-mem1.glcpdev.cloud.hpe.com
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=glop-nm-120-mem1.glcpdev.cloud.hpe.com
kubernetes.io/os=linux
microk8s.io/cluster=true
node.kubernetes.io/microk8s-controlplane=microk8s-controlplane
Annotations: node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.245.244.122/24
projectcalico.org/IPv4VXLANTunnelAddr: 172.23.107.128
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 01 Aug 2024 17:12:54 -0700
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: glop-nm-120-mem1.glcpdev.cloud.hpe.com
AcquireTime: <unset>
RenewTime: Fri, 02 Aug 2024 11:25:37 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 02 Aug 2024 11:10:00 -0700 Fri, 02 Aug 2024 11:10:00 -0700 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Fri, 02 Aug 2024 11:25:25 -0700 Fri, 02 Aug 2024 11:24:05 -0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 02 Aug 2024 11:25:25 -0700 Fri, 02 Aug 2024 11:24:05 -0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Fri, 02 Aug 2024 11:25:25 -0700 Fri, 02 Aug 2024 11:24:05 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Fri, 02 Aug 2024 11:25:25 -0700 Fri, 02 Aug 2024 11:24:05 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.245.244.122
Hostname: glop-nm-120-mem1.glcpdev.cloud.hpe.com
Capacity:
cpu: 64
ephemeral-storage: 551044160Ki
hugepages-1Gi: 0
hugepages-2Mi: 4Gi
memory: 264105564Ki
pods: 555
Allocatable:
cpu: 64
ephemeral-storage: 549995584Ki
hugepages-1Gi: 0
hugepages-2Mi: 4Gi
memory: 259808860Ki
pods: 555
System Info:
Machine ID: 76cff06500b64c5e9b9ff6d48dfb5413
System UUID: 4216f49d-c05e-d63f-0763-b001fa41d910
Boot ID: 88483455-62d0-42f7-a00d-e6acded32ec9
Kernel Version: 5.15.0-111-fips
OS Image: Ubuntu 22.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.15
Kubelet Version: v1.28.7
Kube-Proxy Version: v1.28.7
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-kube-controllers-77bd7c5b-mp5zw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
kube-system calico-node-tjjqw 250m (0%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system coredns-7998696dbd-7xxpt 100m (0%) 100m (0%) 128Mi (0%) 128Mi (0%) 17h
kube-system metrics-server-848968bdcd-jkx6l 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 18h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 450m (0%) 100m (0%)
memory 328Mi (0%) 128Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RegisteredNode 2m14s node-controller Node glop-nm-120-mem1.glcpdev.cloud.hpe.com event: Registered Node glop-nm-120-mem1.glcpdev.cloud.hpe.com in Controller
Normal NodeNotReady 94s node-controller Node glop-nm-120-mem1.glcpdev.cloud.hpe.com status is now: NodeNotReady
Timelines for reference for the attached inspect report. They are approximate times and in PST. Aug 2 11:12+ <- node3 network went out manually triggered it. Aug 2 11:20 <- All 3 nodes went in NotReady state. Aug 2 11:38 <- node1 and node 2 recovered. Aug 2 11:40 <- node3 network was re-established Aug 2 11:41+ <- all nodes are in healthy state.
This run is microk8s v1.28.12, disconnected Node 1 network:
`core@glop-nm-115-mem2:~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
glop-nm-115-mem1.glcpdev.cloud.hpe.com NotReady <none> 70m v1.28.12 10.245.244.117 <none> Ubuntu 22.04.4 LTS 5.15.0-111-fips containerd://1.6.28
glop-nm-115-mem2.glcpdev.cloud.hpe.com NotReady <none> 59m v1.28.12 10.245.244.118 <none> Ubuntu 22.04.4 LTS 5.15.0-111-fips containerd://1.6.28
glop-nm-115-mem3.glcpdev.cloud.hpe.com NotReady <none> 48m v1.28.12 10.245.244.119 <none> Ubuntu 22.04.4 LTS 5.15.0-111-fips containerd://1.6.28`
Attaching logs for node2, node3, when both are reporting NotReady state:
inspection-report-20240802_114743_node3_NotReady.tar.gz inspection-report-20240802_114654_node2_NotReady.tar.gz
These are the logs after node2, node3 recovered: inspection-report-20240802_115338_node2_Ready.tar.gz inspection-report-20240802_115356_node3_Ready.tar.gz
The is logs for node1: inspection-report-20240802_120725_node1.tar.gz
Hi @mathnitin and @veenadong, thanks for helping us get to the bottom of this.
@mathnitin, based on the journal.log
logs in your most recent comment, it seems like node3 was the dqlite cluster leader before being taken offline, and after that node2 won the election to become the next leader. The node1 logs indicate that by 11:23
the new leader is successfully replicating at least some transactions. Unfortunately, the node2 logs, which are the most important for determining why the cluster is NotReady
after node3 goes down, are cut off before 11:40
, at which point the NotReady
period is already over. Perhaps the size or age limits for your journald are keeping those older logs from being retained? If you could collect logs that show the whole period of time between taking node3 offline and recovery on all three nodes it'd be invaluable!
EDIT: It looks like the cutoff in the journalctl logs is due to the limit set in microk8s' inspect script here. If these machines are still available, you could gather more complete logs by doing journalctl -u snap.microk8s.daemon-k8s-dqlite -S '2024-08-02 11:11:00' -U '2024-08-02 11:43:00'
on each affected node.
@veenadong, if you can trigger the issue repeatably, could you follow @louiseschmidtgen's instructions here to turn on dqlite tracing and run journalctl manually (journalctl -u snap.microk8s.daemon-k8s-dqlite -S $start_of_incident -U $end_of_incident
) to get complete logs? Thanks in advance!
@cole-miller Juet recreated the issue on the setup. This time, I set the log level of dqlite to 2
. We are reverting the machines to different states, so can't execute the journalctl command.
Timelines for attached inspect reports. They are approximate times and in PST. Aug 5 15:20 <- node2 network went out manually triggered it. Aug 5 15:21 <- All 3 nodes went in NotReady state. Aug 5 15:36 <- node1 and node 3 recovered. Aug 5 15:40 <- node2 network was re-established all nodes went in Ready state.
Inspect report of node1 and node 3 when all 3 nodes are in Not Ready State Node1: all-3-nodes-down-glop-nm-120-mem1.tar.gz
Node3: all-3-nodes-down-glop-nm-120-mem3.tar.gz
Inspect report of node1 and node 3 when node 1 and node 3 recovered Node1: 1-node-showing-down-glop-nm-120-mem1.tar.gz
Node3: 1-node-showing-down-glop-nm-120-mem3.tar.gz
Inspect report when all 3 nodes are in Ready State Node1: all-3-nodes-up-glop-nm-120-mem1.tar.gz
Node2: all-3-nodes-up-glop-nm-120-mem2.tar.gz
Node3: all-3-nodes-up-glop-nm-120-mem3.tar.gz
Please let us know if you need any other info
Hi @mathnitin,
Thank you for providing further inspection reports. We have been able to reproduce the issue on our end and are in the process of narrowing down the cause of the issue.
We appreciate all your help!
@louiseschmidtgen any update or recommendations for us to try?
Not yet @mathnitin, we are still working on it.
Hi @mathnitin,
We’ve identified the issue and would appreciate it if you could try installing MicroK8s from the temporary channel 1.28/edge/fix-ready
. Please let us know if this resolves the problem on your end.
Your assistance in helping us address this issue is greatly appreciated!
@louiseschmidtgen Thanks for providing the patch. Yes, we will install and test it. We should have an update for you by tomorrow.
@louiseschmidtgen We did our testing with the dev version and below our observations.
kubectl get nodes
. One question, is there a command we can use to find the dqlite leader at a given point of time.
@mathnitin, thank you for your feedback on our dev version! We appreciate it and will take it into consideration for improving our solution.
To find out who the dqlite leader is, you can run the following command:
sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
@louiseschmidtgen We saw a new issue on the dev version. For one of our run when we disconnected the network microk8s looses HA. This is microk8s status of all the 3 nodes after we connect the network back. I don't know if we will be able to give you steps to recreate it, if we do will let you know.
Logs for Node 1.
root@glop-nm-110-mem1:/var/snap/microk8s/common# cat .microk8s.yaml <-- Launch config to bringup microk8s
version: 0.2.0
persistentClusterToken: d3b0c44298fc1c149afbf4c8996fb925
addons:
- name: rbac
- name: metrics-server
- name: dns
extraCNIEnv:
IPv4_CLUSTER_CIDR: 172.23.0.0/16
IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
--service-node-port-range: 80-32767
extraKubeletArgs:
--max-pods: 555
--cluster-domain: cluster.local
--cluster-dns: 172.29.0.10
extraContainerdArgs:
--root: /data/var/lib/containerd/
--state: /data/run/containerd/
extraSANs:
- 172.29.0.1
- 172.23.0.1
root@glop-nm-110-mem1:/var/snap/microk8s/common# microk8s status
microk8s is running
high-availability: no
datastore master nodes: glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
datastore standby nodes: none
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
metallb # (core) Loadbalancer for your Kubernetes cluster
metrics-server # (core) K8s Metrics Server for API access to service metrics
rbac # (core) Role-Based Access Control for authorisation
disabled:
cert-manager # (core) Cloud native certificate management
cis-hardening # (core) Apply CIS K8s hardening
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
host-access # (core) Allow Pods connecting to Host services smoothly
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
kube-ovn # (core) An advanced network fabric for Kubernetes
mayastor # (core) OpenEBS MayaStor
minio # (core) MinIO object storage
observability # (core) A lightweight observability stack for logs, traces and metrics
prometheus # (core) Prometheus operator for monitoring and logging
registry # (core) Private image registry exposed on localhost:32000
rook-ceph # (core) Distributed Ceph storage using Rook
storage # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem1:/var/snap/microk8s/common#
root@glop-nm-110-mem1:/var/snap/microk8s/common# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
Logs for Node 2.
root@glop-nm-110-mem2:/var/snap/microk8s/common# cat .microk8s.yaml
version: 0.2.0
join:
url: glop-nm-110-mem1.glcpdev.cloud.hpe.com:25000/d3b0c44298fc1c149afbf4c8996fb925
extraCNIEnv:
IPv4_CLUSTER_CIDR: 172.23.0.0/16
IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
--service-node-port-range: 80-32767
extraKubeletArgs:
--max-pods: 555
--cluster-domain: cluster.local
--cluster-dns: 172.29.0.10
extraContainerdArgs:
--root: /data/var/lib/containerd/
--state: /data/run/containerd/
extraSANs:
- 172.29.0.1
- 172.23.0.1
root@glop-nm-110-mem2:/var/snap/microk8s/common# microk8s status
microk8s is running
high-availability: no
datastore master nodes: glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
datastore standby nodes: none
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
metallb # (core) Loadbalancer for your Kubernetes cluster
metrics-server # (core) K8s Metrics Server for API access to service metrics
rbac # (core) Role-Based Access Control for authorisation
disabled:
cert-manager # (core) Cloud native certificate management
cis-hardening # (core) Apply CIS K8s hardening
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
host-access # (core) Allow Pods connecting to Host services smoothly
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
kube-ovn # (core) An advanced network fabric for Kubernetes
mayastor # (core) OpenEBS MayaStor
minio # (core) MinIO object storage
observability # (core) A lightweight observability stack for logs, traces and metrics
prometheus # (core) Prometheus operator for monitoring and logging
registry # (core) Private image registry exposed on localhost:32000
rook-ceph # (core) Distributed Ceph storage using Rook
storage # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem2:/var/snap/microk8s/common# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
Logs for Node 3.
root@glop-nm-110-mem3:/var/snap/microk8s/common# cat .microk8s.yaml
version: 0.2.0
join:
url: glop-nm-110-mem1.glcpdev.cloud.hpe.com:25000/d3b0c44298fc1c149afbf4c8996fb925
extraCNIEnv:
IPv4_CLUSTER_CIDR: 172.23.0.0/16
IPv4_SERVICE_CIDR: 172.29.0.0/23
extraKubeAPIServerArgs:
--service-node-port-range: 80-32767
extraKubeletArgs:
--max-pods: 555
--cluster-domain: cluster.local
--cluster-dns: 172.29.0.10
extraContainerdArgs:
--root: /data/var/lib/containerd/
--state: /data/run/containerd/
extraSANs:
- 172.29.0.1
- 172.23.0.1
root@glop-nm-110-mem3:/home/core# microk8s status
microk8s is running
high-availability: no
datastore master nodes: none
datastore standby nodes: none
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
metallb # (core) Loadbalancer for your Kubernetes cluster
metrics-server # (core) K8s Metrics Server for API access to service metrics
rbac # (core) Role-Based Access Control for authorisation
disabled:
cert-manager # (core) Cloud native certificate management
cis-hardening # (core) Apply CIS K8s hardening
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
host-access # (core) Allow Pods connecting to Host services smoothly
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
kube-ovn # (core) An advanced network fabric for Kubernetes
mayastor # (core) OpenEBS MayaStor
minio # (core) MinIO object storage
observability # (core) A lightweight observability stack for logs, traces and metrics
prometheus # (core) Prometheus operator for monitoring and logging
registry # (core) Private image registry exposed on localhost:32000
rook-ceph # (core) Distributed Ceph storage using Rook
storage # (core) Alias to hostpath-storage add-on, deprecated
root@glop-nm-110-mem3:/home/core#
root@glop-nm-110-mem3:/home/core#
root@glop-nm-110-mem3:/home/core# sudo -E LD_LIBRARY_PATH=/snap/microk8s/current/usr/lib /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"
glop-nm-110-mem1.glcpdev.cloud.hpe.com:19001
root@glop-nm-110-mem3:/home/core# microk8s kubectl get nodes
NAME STATUS ROLES AGE VERSION
glop-nm-110-mem1.glcpdev.cloud.hpe.com Ready
Node Inspect reports Node1: node-1-inspection-report-20240814_110705.tar.gz
Hi @mathnitin,
thank you for reporting your new issue with the dev-fix including the inspection reports. We are taking a careful look at your logs and are trying to create a reproducer ourselves.
Thank you for your patience and your help in improving our solution!
Hi @mathnitin, could you please let us know which node you disconnected from the network?
For the inspect report, we disconnected node 1. It being the dqlite leader cluster became unhealthy. What our observation is that the cluster should be running in HA mode somehow this cluster lost HA and only node 1 is recognized. The get nodes does show all 3 are part of k8s cluster
Hello @mathnitin,
Thank you for providing the additional details. Unfortunately, we weren't able to reproduce the issue on the dev version. Could you please share the exact steps we need to follow to reproduce it? On our side, removing and re-joining the node from the cluster doesn’t seem to trigger the failure.
I will be unavailable next week, but @berkayoz will be taking over and can help you in my absence.
@louiseschmidtgen @berkayoz We tried to reproduce the HA disconnect on our side. We are also not able to reproduce it. We think somehow our VMware snapshot was corrupted as every time we go to the snapshot, we are seeing this issue. Are we not able to find anything from the inspect?
Also, do we have any insight on why data plane is lost for approx 30 sec? When we take the dqlite leader out, this spans for us over 1min 40sec.
For data plane testing, we started 3 nginx pods with nodeantiaffinity and nodeport. We ran curl with timeout of 1 sec. For this test, we see connection timeouts of 10 sec to 30+ sec.
@mathnitin
From the inspection reports we are seeing cluster.yaml
that contains the dqlite members is missing the node3
, this is consistent across all 3 nodes. This could be related to a disturbance/issue happening while the 3rd node was joining. There is a small period between a joining node getting accepted and cluster.yaml
files being updated on the cluster members. Since we can observe the 3rd node in the kubernetes cluster the join operation was successful but possibly cluster.yaml
could not get updated in time.
How fast was this snapshot created? Could it be right after node 3 join operation?
Could you provide more information(the deployment manifest etc.) and possible reproduction steps on the data plane connection/timeout issues you've mentioned?
Thank you.
@berkayoz Please see the comments inline.
How fast was this snapshot created? Could it be right after node 3 join operation?
The snapshot was created after making sure the cluster is in healthy state. However we are not able to recreate this issue.
Could you provide more information(the deployment manifest etc.) and possible reproduction steps on the data plane connection/timeout issues you've mentioned?
Below is the nginx yaml file we are deploying. We have exposed the same nginx deployment with metallb as well.
$ cat nginx-service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx-service
spec:
type: NodePort
selector:
app: my-nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
$ cat nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 3
selector:
matchLabels:
app: my-nginx
template:
metadata:
labels:
app: my-nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
The sample script we are using to check whether the data plane is operational or not. Below is the Metallb script.
#!/bin/bash
# URL to check
url="http://<VIP>"
# Counter for non-200 responses
non_200_count=0
while true
do
# Perform the request
status_code=$(curl --connect-timeout 1 -m 1 -s -o /dev/null -w "%{http_code}" "$url")
echo $status_code
echo $url
date
# Check if the status code is not 200
if [ "$status_code" != "200" ]; then
echo "Response code: $status_code"
echo $url
date
non_200_count=$((non_200_count + 1))
fi
sleep 1
if [ "$non_200_count" == "1000" ]; then
# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"
break
fi
done
# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"
Below is the NodePort script. We are making sure the node IP is not the node that we have brought down.
#!/bin/bash
# URL to check
url="http://<NODE_IP>:<NODE_PORT>"
# Counter for non-200 responses
non_200_count=0
while true
do
# Perform the request
status_code=$(curl --connect-timeout 1 -m 1 -s -o /dev/null -w "%{http_code}" "$url")
echo $url
echo $status_code
date
# Check if the status code is not 200
if [ "$status_code" != "200" ]; then
echo "Response code: $status_code"
echo $url
date
non_200_count=$((non_200_count + 1))
fi
sleep 1
if [ "$non_200_count" == "1000" ]; then
# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"
break
fi
done
# Print the count of non-200 responses
echo "Count of non-200 responses: $non_200_count"
Hey @mathnitin,
We are working toward a final fix and currently looking into go-dqlite
side of things with the team.
I'll provide some comments related to the feedback you have provided on the dev version/possible fix.
With this fix, we are seeing a delay in the detection of nodes for "NotReady" state. The detection can take from 1 minute to 5 minutes. We are running watch on kubectl get nodes.
I've run some tests regarding this, my findings are as follows:
NotReady
or Ready
state for a node(s) that is not the dqlite leader is ~40s
which is aligned with the kubernetes default.80s-120s
. This might be related to datastore not being available while leader re-election is happening which might lead to missing/dropping detection cycles. I am looking more into this situation.For data plane testing, we started 3 nginx pods with nodeantiaffinity and nodeport. We ran curl with timeout of 1 sec. For this test, we see connection timeouts of 10 sec to 30+ sec.
I've tried to reproduce this with the NodePort
approach. My observations for this are as follows:
NotReady
and the pod is removed from the service selector.~40s
default detection time.NotReady
which is explained above.We saw a new issue on the dev version. For one of our runs when we disconnected the network microk8s looses HA.
We could not reproduce this and we believe the issue is not related to the patch in the dev version.
I'll keep updating here with new progress, let me know you have any other questions or observations.
Hey @mathnitin
I've looked more into your feedback and I have some extra comments.
With this fix, we are seeing a delay in the detection of nodes for "NotReady" state. The detection can take from 1 minute to 5 minutes. We are running watch on kubectl get nodes.
I've stated previously there was an extra delay for a node that is also the dqlite leader. On testing, the first created node is usually the dqlite leader. Additionally this node will also be the leader for kube-controller-manager
and kube-scheduler
components. Taking down this node leads to multiple leader elections.
kube-controller-manager
will perform a leader election, and will have to wait for datastore to settle first since leases are used.kube-scheduler
wil perform a leader election, and will have to wait for datastore to settle first since leases are used.The dqlite leader election happens pretty quickly. For kube-controller-manager
and kube-scheduler
, MicroK8s adjusts the leader election configuration in these components to lower resource consumption. These adjustments are
--leader-elect-lease-duration=60s
--leader-elect-renew-deadline=30s
You can override these like --leader-elect-lease-duration=15s
and --leader-elect-renew-deadline=10s
to match Kubernetes defaults in the following files:
/var/snap/microk8s/current/args/kube-scheduler
/var/snap/microk8s/current/args/kube-controller-manager
This will result in a quicker node fail-over and status detection.
These changes should also reduce the period of failing requests in the nginx data plane testing.
@berkayoz Thanks for the recommendation. We tried with the configuration changes. For network disconnect usecase, we are noticing that the Control plane detection for NotReady
state is faster. The data plane loss numbers remain the same.
You are correct data plane loss is not a complete loss, these are intermittent failures. We would have assumed the failures will be in round-robin fashion, however these failures are consistent for few seconds in batches. Is there a way we can improve this?
Hey @mathnitin,
Kube-proxy in iptables
mode selects the endpoint randomly, kube-proxy in ipvs
mode has more options for load-balancing and uses round robin by default. This might match the round-robin expectations and might improve the failures. We are working on testing this change, providing how-to steps and more info related to this change. We will update here with a follow up comment on this.
It could also be possible to declare a node NotReady faster by changing the --node-monitor-grace-period
kube-controller-manager flag. This is 40s
by default, in alignment with the upstream value. Lowering this value could reduce the request failure period but could result in an undesired side effect if lowered too much.
Also see what appears to be the same issue here on v1.29 which we have been trying to bottom out. Easily reproducible, sometimes with all 3 nodes going NotReady for similar time as above.
Happy to provide further logs or also test fixes if appropriate.
Also seeing this on 1.29.4 deployments with 3 nodes. As above, can provide config, logs or test potential fixes.
Hello @cs-dsmyth and @kcarson77, we are back-porting the fix into all supported microk8s versions (1.28-1.31).
Hello @mathnitin,
the fix is now in the MicroK8s 1.28/stable
channel. @cs-dsmyth, @kcarson77 For MicroK8s channels 1.28-strict
to latest
the fix will make its way from beta into the stable channel by the beginning of next week.
Thank you for raising the issue and for providing data and helping with testing to reach the solution.
Hello @mathnitin,
I would like to point you to ipvs
kube-proxy mode to address the intermittent failures you are seeing when a node is removed from your cluster. I have tested ipvs
mode with your nginx scripts on a dev snap and can confirm that the failures are in round-robin fashion. Unfortunately, currently ipvs
mode does not work on MicroK8s 1.28 due to a Calico issue with ipset which is addressed in a newer Calico version that will land with MicroK8s 1.32.
We will publish documentation on how to run kube-proxy in ipvs
mode in MicroK8s 1.32.
@louiseschmidtgen Can you please provide me the PR you merged in the dqlite repo and the microk8s 1.28 branch? We are following the https://discuss.kubernetes.io/t/howto-enable-fips-mode-operation/25067 steps to build the private snap package and realized the changes are not merged to the fips branch.
Hello @mathnitin,
This is the patch PR for the 1.28 (classic) microk8s: https://github.com/canonical/microk8s/pull/4651. This is the new tag in dqlite v1.1.11 with the patch: https://github.com/canonical/k8s-dqlite/pull/161. MicroK8s fips branch points to k8s-dqlite master (which has the fix): https://github.com/canonical/microk8s/blob/fips/build-scripts/components/k8s-dqlite/version.sh.
If you encounter any issues building the fips snap please open another issue and we will be happy to help you resolve them.
Hi @mathnitin,
if you are building the fips snap I would recommend pointing to k8s-dqlite the latest tag v1.2.0
instead of master. As master is under development.
I hope your project goes well, thank you again for contributing to the fix I will be closing this issue.
Summary
We have a 3 node Microk8s HA enabled cluster which is running microk8s version
1.28.7
. If one of the 3 nodes (say node3) experiences a power outage or network glitch and is not recoverable, another node (say node1) goes into NotReady state. For about 15+ minutes, node1 is in NotReady. This time can take up to 30 minutes sometime.What Should Happen Instead?
Only 1 node should be in the NotReady state. Other 2 nodes should be healthy and working.
Reproduction Steps
1.28.7
Introspection Report
Node 1 Inspect report inspection-report-20240801_130021.tar.gz
Node 2 Inspect report inspection-report-20240801_131139.tar.gz
Node 3 Inspect report inspection-report-20240801_130117.tar.gz
Aditional information
Timelines for reference for the attached inspect report. They are approximately times and in PST. Aug 1 12:40 <- node3 network went out manually triggered it. Aug 1 12:41 <- node1 went in NotReady state. Aug 1 12:56 <- node1 recovered. Aug 1 12:59 <- node3 network was re-established Aug 1 13:01 <- all nodes are in healthy state.