k8s-dqlite spiking cpu core to 100%

WilliamG-LORA commented 2 years ago

Summary

I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the /snap/microk8s/3204/bin/k8s-dqlite process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.

I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used: htop

The specs of the machines are as follows:

Node 1:
- CPU: AMD Threadripper 1950X
- RAM: 64GB
Node 2:
- CPU: i7-7820X
- RAM: 64
Node 3:
- CPU: i7-9700
- RAM: 32
Node 4:
- CPU: i7-9700K
- RAM: 64

The cluster has the metallb, dns, rbac, and storage enabled. I've also deployed Rook-Ceph on the cluster.

What Should Happen Instead?

It shouldn't be using over 100% of a core.

Reproduction Steps

Create a microk8s cluster
Deploy Rook-Ceph
Wait a bit.
I'm not sure how to properly reproduce this issue...

Introspection Report

inspection-report-20220608_143601.tar.gz

texaclou commented 1 year ago

For comparaison ... the exact same cluster, functionnal, after complete microk8s reinstall.

inspection-report-20230504_084810.tar.gz

perf

jgc234 commented 1 year ago

I seem to be bumping into a similar problem.. After running fine for a few weeks, kubelite is now crashing on all nodes every 2 minutes for the past few days. It appears to be from delayed responses and eventual timeouts to dqlite, although there's so many things failing I find it difficult to tell.

k8s-dqlite is pegged at 100% of a core - it looks transient from a single node, but when you look across the whole cluster you can see it rotating around the dqlite voter nodes. So there's always one node with a 100% consuming k8s-dqlite, and it changes every few seconds.

Restarting the whole cluster will get at least one node's kubelite going long enough to run a few kubectl commands, then the timeouts eventually stop those too. The node(s) that work temporarily seems to vary.

If I disable all microk8s services and only start k8s-dqlite (ie nothing talking to it), the k8s-dqlite process still pegs at 100%. As noted above, it's only a single host at any one time. It rotates around the voter nodes for a while, then the leader node pegs at 100% for a long while (been watching it stuck for 1 hour at 100% with no other microk8s services running).

If I start the rest of the microk8s services, dqlite goes back to rotating 100% on each voter node for about 5-20 seconds for each node. (it seems to get progressively worse), then the CPU drops for k8s-dqlite after a while, but the kubernetes completely dead by that point. I tried to look at the perf data for the process, but I'm not sure how to the resolve the symbols from the libraries, which is where it's all happening - not much popping back out as system calls.

I tried a command line dqlite (a simple select query) and it returns instantaneously, but I have no idea what the schema looks like underneath to do a proper test.. Also this is via TCP, not via a UNIX socket which I think kubelite is communicating via.

A rough overview - microk8s v1.27.1 5250 latest/stable (, 4 nodes, all nodes Ubuntu 23.04, virtual KVM guests (48GB RAM each, 6 CPUs), all hosted on the same hypervisor (Ubuntu 22.04.2 LTS).

This is as good as it get, but it only lasts a few minutes.

root@node00:~# microk8s kubectl get nodes -o wide
NAME     STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
node00   NotReady   <none>   22d   v1.27.1   10.232.8.70   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node01   Ready      <none>   21d   v1.27.1   10.232.8.71   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node03   Ready      <none>   21d   v1.27.1   10.232.8.73   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node02   Ready      <none>   21d   v1.27.1   10.232.8.72   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
root@node00:~# microk8s kubectl get nodes -o wide
NAME     STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
node02   NotReady   <none>   22d   v1.27.1   10.232.8.72   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node03   NotReady   <none>   22d   v1.27.1   10.232.8.73   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node01   NotReady   <none>   22d   v1.27.1   10.232.8.71   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15
node00   Ready      <none>   22d   v1.27.1   10.232.8.70   <none>        Ubuntu 23.04   6.2.0-20-generic   containerd://1.6.15

root@node01:~# microk8s kubectl get nodes -o wide
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

Lots of these - every 3 seconds, on all nodes. Before the cluster broke, these messages didn't exist but a locking error popped up occassionally.

May 26 19:10:13 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:13+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:10:16 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:16+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:10:19 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:19+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:10:22 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:22+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:10:25 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:25+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:10:28 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:28+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:10:31 node01 microk8s.daemon-k8s-dqlite[1451]: time="2023-05-26T19:10:31+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"

A sample of logs.

May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/storageclasses/ /registry/storageclasses/: query (try: 0): context deadline exceeded"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/leases/kube-system/ /registry/leases/kube-system/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/leases/kube-system/ /registry/leases/kube-system/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/ingressclasses/ /registry/ingressclasses/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.487350   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Role: failed to list *v1.Role: the server was unable to return a response in the time allotte>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.488369   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.488669   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.488708   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.488874   12996 reflector.go:533] pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81: failed to list *v1.Lease: the server was unable to return a response in the time allotted, but ma>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.488885   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.StorageClass: the server was unable to return a response in the time allotted, but may still b>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.488922   12996 trace.go:219] Trace[2118166691]: "Reflector ListAndWatch" name:pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81 (26-May-2023 18:59:44.454) (total time: 60034ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[2118166691]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io) 60034ms (19:00:44.488)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[2118166691]: [1m0.034461205s] [1m0.034461205s] END
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/validatingwebhookconfigurations/ /registry/validatingwebhookconfigurations/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/resourcequotas/ /registry/resourcequotas/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.488938   12996 reflector.go:148] pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81: Failed to watch *v1.Lease: failed to list *v1.Lease: the server was unable to return a response i>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.488939   12996 trace.go:219] Trace[1138694341]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.454) (total time: 60034ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1138694341]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get storageclasses.storage.k8s.io) 60034ms (19:00:44.488)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1138694341]: [1m0.034533433s] [1m0.034533433s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.488956   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: the server was unable to return a response in >
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.488889   12996 reflector.go:533] pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81: failed to list *v1.Lease: the server was unable to return a response in the time allotted, but ma>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.488991   12996 trace.go:219] Trace[177358991]: "Reflector ListAndWatch" name:pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81 (26-May-2023 18:59:44.455) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[177358991]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io) 60033ms (19:00:44.488)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[177358991]: [1m0.033638016s] [1m0.033638016s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489000   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489003   12996 reflector.go:148] pkg/controlplane/controller/apiserverleasegc/gc_controller.go:81: Failed to watch *v1.Lease: failed to list *v1.Lease: the server was unable to return a response i>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.489251   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.IngressClass: the server was unable to return a response in the time allotted, but may still b>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.489300   12996 trace.go:219] Trace[1312317624]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.455) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1312317624]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get ingressclasses.networking.k8s.io) 60033ms (19:00:44.489)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1312317624]: [1m0.033886267s] [1m0.033886267s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489315   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.IngressClass: failed to list *v1.IngressClass: the server was unable to return a response in >
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489372   12996 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489441   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.489778   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.ValidatingWebhookConfiguration: the server was unable to return a response in the time allotte>
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/configmaps/kube-system/ /registry/configmaps/kube-system/: query (try: 0): context canceled"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/pods/ /registry/pods/: query (try: 0): context deadline exceeded"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/clusterroles/ /registry/clusterroles/: query (try: 0): context deadline exceeded"
May 26 19:00:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:00:44+10:00" level=error msg="error while range on /registry/runtimeclasses/ /registry/runtimeclasses/: query (try: 0): context deadline exceeded"
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.489852   12996 trace.go:219] Trace[1649269503]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.456) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1649269503]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get validatingwebhookconfigurations.admissionregistration.k8>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1649269503]: [1m0.033711873s] [1m0.033711873s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489870   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.ValidatingWebhookConfiguration: failed to list *v1.ValidatingWebhookConfiguration: the server>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.489852   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.490140   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.ResourceQuota: the server was unable to return a response in the time allotted, but may still >
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.490188   12996 trace.go:219] Trace[608453332]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.456) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[608453332]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get resourcequotas) 60033ms (19:00:44.490)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[608453332]: [1m0.033384183s] [1m0.033384183s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.490203   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.ResourceQuota: failed to list *v1.ResourceQuota: the server was unable to return a response i>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.491188   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.491449   12996 reflector.go:533] pkg/controlplane/controller/clusterauthenticationtrust/cluster_authentication_trust_controller.go:444: failed to list *v1.ConfigMap: the server was unable to retur>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.491495   12996 trace.go:219] Trace[1510503521]: "Reflector ListAndWatch" name:pkg/controlplane/controller/clusterauthenticationtrust/cluster_authentication_trust_controller.go:444 (26-May-2023 18:>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1510503521]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps) 60033ms (19:00:44.491)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1510503521]: [1m0.033242845s] [1m0.033242845s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.491509   12996 reflector.go:148] pkg/controlplane/controller/clusterauthenticationtrust/cluster_authentication_trust_controller.go:444: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap:>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.492475   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.492619   12996 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc0007b1a78)}: rpc error: code = Unknown desc = query (try: 0): context dea>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.492645   12996 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc0075544c0)}: rpc error: code = Unknown desc = query (try: 0): context dea>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.492840   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.RuntimeClass: the server was unable to return a response in the time allotted, but may still b>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.492890   12996 trace.go:219] Trace[1495591756]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.459) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1495591756]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get runtimeclasses.node.k8s.io) 60033ms (19:00:44.492)
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[1495591756]: [1m0.033539879s] [1m0.033539879s] END
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:44.492906   12996 reflector.go:148] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.RuntimeClass: failed to list *v1.RuntimeClass: the server was unable to return a response in >
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.492956   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be process>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: W0526 19:00:44.492965   12996 reflector.go:533] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.ClusterRole: the server was unable to return a response in the time allotted, but may still be>
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:44.492994   12996 trace.go:219] Trace[873362192]: "Reflector ListAndWatch" name:vendor/k8s.io/client-go/informers/factory.go:150 (26-May-2023 18:59:44.459) (total time: 60033ms):
May 26 19:00:44 node02 microk8s.daemon-kubelite[12996]: Trace[873362192]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get pods) 60033ms (19:00:44.492)

...

May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.720813   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.720960   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721049   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721125   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721197   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721201   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721302   12996 timeout.go:142] post-timeout activity - time-elapsed: 2.217553ms, GET "/apis/crd.projectcalico.org/v1/clusterinformations/default" result: <nil>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721333   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721342   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721464   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721670   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.720789   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721725   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.721852   12996 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, the server was unable to return a response in the time allotted, but may still be processing t>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.723889   12996 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.725050   12996 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.730052   12996 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: I0526 19:00:58.731449   12996 trace.go:219] Trace[1848156781]: "Get" accept:application/vnd.kubernetes.protobuf, */*,audit-id:dd544ef6-fbbd-44ed-8e05-217a8dd509b2,client:::1,protocol:HTTP/2.0,resource:serviceacc>
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: Trace[1848156781]: [1m0.011370865s] [1m0.011370865s] END
May 26 19:00:58 node02 microk8s.daemon-kubelite[12996]: E0526 19:00:58.731764   12996 timeout.go:142] post-timeout activity - time-elapsed: 11.430535ms, GET "/api/v1/namespaces/kube-system/serviceaccounts/calico-node" result: <nil>
May 26 19:01:01 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:01+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:04 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:04+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:07 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:07+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:10 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:10+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:13 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:13+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:16 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:16+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:19 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:19+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:22 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:22+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:25 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:25+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:26.008270   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:26.008434   12996 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
May 26 19:01:26 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:26+10:00" level=error msg="error while range on /registry/minions/ /registry/minions/: query (try: 0): context canceled"
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:26.009529   12996 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:26.010685   12996 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: I0526 19:01:26.012456   12996 trace.go:219] Trace[1083259462]: "List" accept:application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json,audit-id:84fde1f9-93ac>
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: Trace[1083259462]: ["List(recursive=true) etcd3" audit-id:84fde1f9-93ac-4b83-a518-68da7a8d8393,key:/minions,resourceVersion:,resourceVersionMatch:,limit:500,continue: 60004ms (19:00:26.008)]
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: Trace[1083259462]: [1m0.004292191s] [1m0.004292191s] END
May 26 19:01:26 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:26.012582   12996 timeout.go:142] post-timeout activity - time-elapsed: 4.303829ms, GET "/api/v1/nodes" result: <nil>
May 26 19:01:28 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:28+10:00" level=error msg="error while range on /registry/health : query (try: 0): context deadline exceeded"
May 26 19:01:31 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:31+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:34 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:34+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:37 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:37+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:40 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:40+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:43 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:43+10:00" level=error msg="error while range on /registry/health : query (try: 0): context canceled"
May 26 19:01:44 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:44+10:00" level=error msg="error while range on /registry/leases/kube-system/apiserver-kjyqtto24pdthbxs3qj5czwgvi : query (try: 0): context deadline exceeded"
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.682739   12996 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.682871   12996 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.683008   12996 controller.go:146] "Failed to ensure lease exists, will retry" err="the server was unable to return a response in the time allotted, but may still be processing the request (get lea>
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.684129   12996 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.685239   12996 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: I0526 19:01:44.686353   12996 trace.go:219] Trace[1455650653]: "Get" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f19ceda9-fab9-4828-b4c4-d7dcb0b85185,client:::1,protocol:HTTP/2.0,resource:leases,sco>
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: Trace[1455650653]: [1m0.003918036s] [1m0.003918036s] END
May 26 19:01:44 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:44.686419   12996 timeout.go:142] post-timeout activity - time-elapsed: 3.687069ms, GET "/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/apiserver-kjyqtto24pdthbxs3qj5czwgvi" result: <nil>
May 26 19:01:45 node02 microk8s.daemon-k8s-dqlite[1365]: time="2023-05-26T19:01:45+10:00" level=error msg="error while range on /registry/apiextensions.k8s.io/customresourcedefinitions/ /registry/apiextensions.k8s.io/customresourcedefinitions/: query (try: 0): contex>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:45.334543   12996 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc007554048)}: rpc error: code = Unknown desc = query (try: 0): context dea>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:45.334778   12996 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: W0526 19:01:45.334930   12996 reflector.go:533] pkg/client/informers/externalversions/factory.go:132: failed to list *v1.CustomResourceDefinition: the server was unable to return a response in the time allotted,>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: I0526 19:01:45.334986   12996 trace.go:219] Trace[1698568897]: "Reflector ListAndWatch" name:pkg/client/informers/externalversions/factory.go:132 (26-May-2023 19:00:45.333) (total time: 60001ms):
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: Trace[1698568897]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get customresourcedefinitions.apiextensions.k8s.io) 60001ms >
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: Trace[1698568897]: [1m0.001510594s] [1m0.001510594s] END
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:45.335007   12996 reflector.go:148] pkg/client/informers/externalversions/factory.go:132: Failed to watch *v1.CustomResourceDefinition: failed to list *v1.CustomResourceDefinition: the server was una>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:45.335942   12996 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: E0526 19:01:45.337079   12996 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: I0526 19:01:45.338214   12996 trace.go:219] Trace[1775507505]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:59822732-5526-4fce-998f-0481a5458598,client:::1,protocol:HTTP/2.0,resource:customres>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: Trace[1775507505]: ["List(recursive=true) etcd3" audit-id:59822732-5526-4fce-998f-0481a5458598,key:/apiextensions.k8s.io/customresourcedefinitions,resourceVersion:0,resourceVersionMatch:,limit:500,continue: 6000>
May 26 19:01:45 node02 microk8s.daemon-kubelite[12996]: Trace[1775507505]: [1m0.004087505s] [1m0.004087505s]

I originally focused on the authentication errors, but I can do a curl with certs and tokens and its seems OK, and kubectl is working for a while.

I'm having trouble getting a full inspect output. If I get one should I dump it here, or does this scenario look different to problem stated in this issue? Any pointer would be much appreciated.

jgc234 commented 1 year ago

I've managed to recover my cluster.. This is great news, since every HA-enabled microk8s I've built in the past year or so has eventually ended unresponsive after a few months, and I couldn't understand how to recover them - it was easier to rebuild a new cluster, re-deploy and re-map the PVCs.

I deleted all nodes except one in clusters.yaml, and did a dqlite .reconfigure. On restart kubelite (via dqlite) was still struggling with timeouts, possibly from the extra CPU contention of the additional containers crammed onto a single node, so I added another 10 cores to the node and that seemed to fix it. It's been up for 4 hours without a kubelite crash. Just to note - total CPU per node wasn't an issue previously, since the cluster couldn't get off the floor - there was no container workloads running, leaving 5 cores doing nothing and 1 busy on dqlite, and it still wouldn't start. Even with with full workloads running, each node was running at 20-50% CPU load (5-min average) before.

As noted above, I noticed the dqlite 100% CPU rolling around the cluster, one node at a time. I was hoping to reduce the total delays seen by kubelite by removing the apparent round-robin through nodes, at best case to be stuck with the 5-second dqlite 100% peg on just 1 node instead of 3 nodes sequentially and would reduce the total latency enough to prevent the avalanche of timeouts. It turned out that the dqlite stopped CPU hogging on the single node too, although it seems pretty resources sensitive.

I haven't been brave enough to add a new node back in yet.

I've got some inspection files and perf data files from the other nodes when they were running (and block latency and network stats between all nodes afterwards) if that's useful to anyone to have a look.

doctorpangloss commented 1 year ago

every HA-enabled microk8s I've built in the past year or so has eventually ended unresponsive after a few months

I migrated to vanilla Kubernetes.

ktsakalozos commented 1 year ago

I've got some inspection files and perf data files from the other nodes when they were running (and block latency and network stats between all nodes afterwards) if that's useful to anyone to have a look.

@jgc234 would it be possible to share here the files you gathered?

jgc234 commented 1 year ago

@jgc234 would it be possible to share here the files you gathered?

@ktsakalozos - files attached.. Some of the commands in the inspect failed to collect all data, since the cluster didn't respond at some points, so I've included as many files as I could gather.

As a quick overview - the cluster seemed to start failing around 23 May 01:00-03:14. The first node00 inspection file has some logs of dqlite during this period and you can see it starting to fail. The next few days afterward were an attempt to work out what was going on and trying to get the cluster working again.

Some perf.data files of dqlite.. Unfortunately I didn't keep a good log of what was going on at each time point. Although most would have been 120 seconds, since kubelite was crashing every 2 minutes.

node00-perf-20230527_214035.data.gz node00-perf-20230528_110233.data.gz node01-perf-20230527_214058.data.gz node01-perf-20230527_221158.data.gz node02-perf-20230526_164430.data.gz node02-perf-20230527_221255.data.gz node02-perf-20230527_222055.data.gz node03-perf-20230527_214321.data.gz node03-perf-20230527_222252.data.gz

Some network stats:

iperf between all nodes around 15-20Gb/s
ping between all nodes similar to - rtt min/avg/max/mdev = 0.081/0.172/0.283/0.035 ms, at 1400 byte payload, 0.05s interval, 100 packets transmitted, 100 received, 0% packet loss, time 5593ms

I couldn't get bcc tools working (wrong kernel version?) to get real block latency data. It does work on the underlaying hypervisor but harder to work out what's what.

Here's an fio on a node:

root@node00:~# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --readwrite=randrw --rwmixread=75 --size=4G --filename=blah.test
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=81.5MiB/s,w=26.4MiB/s][r=20.9k,w=6758 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1028593: Sun May 28 20:47:01 2023
  read: IOPS=27.5k, BW=108MiB/s (113MB/s)(3070MiB/28533msec)
   bw (  KiB/s): min=19896, max=307344, per=100.00%, avg=112247.91, stdev=52289.71, samples=56
   iops        : min= 4974, max=76836, avg=28062.00, stdev=13072.45, samples=56
  write: IOPS=9205, BW=36.0MiB/s (37.7MB/s)(1026MiB/28533msec); 0 zone resets
   bw (  KiB/s): min= 6384, max=101760, per=100.00%, avg=37511.48, stdev=17449.84, samples=56
   iops        : min= 1596, max=25440, avg=9377.86, stdev=4362.45, samples=56
  cpu          : usr=10.10%, sys=29.40%, ctx=424845, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=3070MiB (3219MB), run=28533-28533msec
  WRITE: bw=36.0MiB/s (37.7MB/s), 36.0MiB/s-36.0MiB/s (37.7MB/s-37.7MB/s), io=1026MiB (1076MB), run=28533-28533msec

Disk stats (read/write):
  vda: ios=780732/261830, merge=0/2498, ticks=93710/1657546, in_queue=1773617, util=98.81%

iperf from all other nodes to node01:

root@node01:/var/snap/microk8s/current# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  1] local 10.232.8.71 port 5001 connected with 10.232.8.73 port 43420 (icwnd/mss/irtt=14/1448/120)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0017 sec  20.3 GBytes  17.4 Gbits/sec
[  2] local 10.232.8.71 port 5001 connected with 10.232.8.72 port 52556 (icwnd/mss/irtt=14/1448/1006)
[ ID] Interval       Transfer     Bandwidth
[  2] 0.0000-10.0003 sec  23.5 GBytes  20.2 Gbits/sec
[  3] local 10.232.8.71 port 5001 connected with 10.232.8.70 port 52682 (icwnd/mss/irtt=14/1448/131)
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0024 sec  18.0 GBytes  15.5 Gbits/sec
[  4] local 10.232.8.71 port 5001 connected with 10.232.8.23 port 48146 (icwnd/mss/irtt=14/1448/151)
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.0020 sec  20.9 GBytes  17.9 Gbits/sec

texaclou commented 1 year ago

@ktsakalozos Do you have any clues about the source of this problem? Dqlite corruption ? or is it hiding another problem ?

ktsakalozos commented 1 year ago

It seems there is no single source for the reported symptoms but there is no indication there is data corruption in any of the reported cases.

Dqlite uses sqlite to store data and employs a raft protocol to ensure consensus on what is written in the database. On each write database request the dqlite leader node contacts its peers holding a copy of the data and pushes the write request to them. As soon as the majority of the peers persist the data the leader acknowledges the write request and commits it to the "master-copy" it maintains. This mechanism is sensitive to:

network latency among nodes
disk latency on the nodes replicating the data
cpu starvation

The above painpoints are common in raft based datastores and known to databases in general. In order to address them kubernetes architecture allows for isolating the users workloads from the control plane so as not to compete for cpu/disk/network resources with the datastore. In the MicroK8s distribution you can add worker-only nodes using the --worker flag in the microk8s join command and then use taints/disable scheduling so as not to host user workloads on the control plane nodes. If you want to take this a step further you can setup and external etcd [1] so as to dedicate specific disk/network/cpu resources to the datastore. This control-plane isolation would help debugging efforts too. At a high level what you are looking for is an indication on why the datastore could not get the resources it needed at the specific time it requested them. It is much harder to identify the root cause of resoruce shortage if user workloads are in the middle.

[1] https://microk8s.io/docs/external-etcd

jgc234 commented 1 year ago

@ktsakalozos - thanks for the info. I'll see if I can get the cluster back into that state and experiment with the load separation. Given this is a shared underlying hypervisor for all 4 nodes, I'll try and host the node elsewhere to prevent it effecting performance indirectly.

I understand this problem in a normal state, but in it's broken state, the cluster isn't running any workloads - its dead quiet. Upon cluster start, barely one or two pods manage to start - overall, 28 cores mostly idle, 600GB of RAM mostly unused, 6x4TB SSDs in a pool doing almost nothing, dqlite not doing IO other than to the UNIX socket, but still stuck consuming 1 core flat-out and not doing many system calls.

If there was resources starvation for IO in this state, I'd expect the see the process blocked in system call, not pegged at 100% in run state (unless there's some async check and spin wait in there?).

But there is a hypervisor in the picture, which make the concept of "resources" a bit tricky to track. I can't see anything interesting upstream yet - the whole host is pretty quiet.

There could be some weird resources synchronisation issue between nodes (eg raft talks to all nodes and causes a synchronised run for a queue on all nodes, choking something underneath) but I can't see evidence of that yet either.

I'll keep looking.

jhughes2112 commented 1 year ago

We get into the broken state at will be running a few too many workloads that consume a lot of async I/O. In particular if we launch more than 11 pods of RedPanda (Kafka) our rather beefy cluster freezes up immediately as dqlite stops being able to communicate. This is with none of the RedPanda instances actually doing anything, just existing and presumably reserving AIO from the host (I assume, I don't know exactly how that works).

We have not upgraded to latest microk8s so this may have been addressed in the last few months.

JH

On Tue, Jun 6, 2023, 7:06 AM Jim Crumpler @.***> wrote:

@ktsakalozos https://github.com/ktsakalozos - thanks for the info. I'll see if I can get the cluster back into that state and experiment with the load separation. Given this is a shared underlying hypervisor for all 4 nodes, I'll try and host the node elsewhere to prevent it effecting performance indirectly.

I understand this problem in a normal state, but in it's broken state, the cluster isn't running any workloads - its dead quiet. Upon cluster start, barely one or two pods manage to start - overall, 28 cores mostly idle, 600GB of RAM mostly unused, 6x4TB SSDs in a pool doing almost nothing, dqlite not doing IO other than to the UNIX socket, but still stuck consuming 1 core flat-out and not doing many system calls.

If there was resources starvation in this state, I'd expect the see the process blocked in system call, not pegged at 100% in run state (unless there's some async check and spin wait in there?)

But there is a hypervisor in the picture, which make the concept of "resources" a bit tricky to track. I can't see anything interesting upstream yet - the whole host is pretty quiet.

There could be some weird resources synchronisation issue between nodes (eg raft talks to all nodes and causes a synchronised run for a queue on all nodes, choking something underneath) but I can't see evidence of that yet either.

I'll keep looking.

— Reply to this email directly, view it on GitHub https://github.com/canonical/microk8s/issues/3227#issuecomment-1578637617, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEV6FDK6ZD2YUVKPPYIQXLXJ4MLDANCNFSM5YFUCNGA . You are receiving this because you were mentioned.Message ID: @.***>

texaclou commented 1 year ago

@ktsakalozos The problem was already met with mono-node cluster , in this case there is no replication latency. In out case, it was every time a mono node, with 10 core cpu. When dqlite stuck at 100% on only 1 core, so 90% of ressource remained for the workload and other kubernetes components. MoreOver the workload was still working, but pods could not be delete , created ...etc, use of kubectl result in lot of timeout...

If the problem was "datastore could not get the resources it needed at the specific time it requested them." Why the cluster never recover a correct state, after days with no change ?

Did you succeeded to reproduce ?

benbuf commented 1 year ago

Any updates on this issue? We are facing similar problems with our cluster (production environment).

JesperBerggren commented 1 year ago

We have a multi-node environment down (again) because of this issue. Thankfully not production this time.

Is there a viable workaround to get up and running (without a complete reinstallation)? It sounds like there might be a way to recover by removing all nodes except one and then do a .reconfigure of the dqlite!? And then add nodes again... If so, can anybody describe the steps in detail?

jhughes2112 commented 1 year ago

One thing we have done (and fully scripted at this point) is sshing into each box and doing a "microk8s stop && microk8s start" one at a time. If that doesn't work, remove all but one node and reconfigure it to be a single node cluster, then add them back in one by one. If it still fails,,take down the node you kept after getting to 4 nodes so you don't lose HA. Our problems are solved recently just by stopping them all.

On Tue, Jun 27, 2023, 3:58 PM Jesper Berggren @.***> wrote:

We have a multi-node environment down (again) because of this issue. Thankfully not production this time.

Is there a viable workaround to get up and running (without a complete reinstallation)? It sounds like there might be a way to recover by removing all nodes except one and then do a .reconfigure of the dqlite!? And then add nodes again... If so, can anybody describe the steps in detail?

— Reply to this email directly, view it on GitHub https://github.com/canonical/microk8s/issues/3227#issuecomment-1610202601, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEV6FDYSXUQ3LAWVW6V6JTXNNCNZANCNFSM5YFUCNGA . You are receiving this because you were mentioned.Message ID: @.***>

benbuf commented 1 year ago

Stopping/Starting did not solve our problems. We have to reinstall the whole cluster (with downtime!).

What I tried: Removing all nodes except one. Reconfigure database like described here: https://microk8s.io/docs/restore-quorum. Then bring back in the other nodes. This worked for 1-2 days then the same problem came back. High cpu load on one node, so that the kubernetes API was not working.

At the moment the cluster is working, but our team is investigating switching to ranchers k3s.

The cluster we had problems with was running since 2020 (Version 1.19). We updated a few times to 1.26. Then the problems started. Could an update have caused the problems? Or an update without updating all Addons?

berkayoz commented 1 year ago

Hey @benbuf

There are multiple versions with different workloads and symptoms. It is hard to say any specific addon or update could've caused the issue. We've added some optimizations in the Kine layer to lower the CPU utilization with 1.27 as mentioned by @ktsakalozos

@texaclou in the graph you shared there is a likeFunc that has been identified as one potential bottleneck. In the 1.27 release that came out today we have worked towards eliminating the use of this LIKE-pattern-matching operation so the performance we get from k8s-dqlite should be improved. This patch however is not available on pre-1.27 releases.

Could you kindly attach inspection reports if there are any available? Would it also be possible for you to share what kind of workloads are running on the cluster, any enabled addons etc. ?

Many thanks.

JesperBerggren commented 1 year ago

We never succeeded with restarts or reconfiguration of our test cluster either so we ended up with a reinstallation yet again.

We were running 1.27 without any Microk8s addons enabled at all. We have a cluster with the official helm charts for all our required addons, e.g. coredns, metrics-server, istio etc. etc.

So I would say that 1.27 has not solved the problem.

berkayoz commented 1 year ago

Hey @JesperBerggren

The optimization in 1.27 seems to address more about the general/overall cpu utilization being high, being caused by the LIKE operator for pattern matching.

Could you kindly share all of the running workloads and some more information about the cluster?

We are working on trying to replicate multiple environments and scenarios to reproduce this issue. Having detailed information on what kinds of workloads are running on the cluster, the hardware the cluster is running on, etc. is really valuable.

Many thanks.

benbuf commented 1 year ago

Hello @berkayoz Sadly there are no inspection reports anymore because we used new server instances. This was the production environment so I would not be able to share it anyway. For the update process we followed the update path (we did not skip a major version). The enabled addons were: dashboard, dns, ha-cluster, helm, helm3, hostpath-storage, ingress, metallb, metrics-server, rbac, storage. For the moment 1.27 is working and the problems seemed to be fixed with the complete re-installation.

texaclou commented 1 year ago

Hello, All my clues, without proof.

All this month, onely one of our two microk8s clusters was falling every month around the 4/5th, since april. Always the same. The workload are the same. All of them are mono-node.

We often found the cluster failed on a monday morning. After 2 days with no deployment and no trafic (the two cluster are Test Environment). Often the onset of outage begins with multiple pending Pods. Some having tried to redeploy and blow up the number of Pods. (as some try to reprint and fill the printing pool). More than 110 pods, with pending and failed ones. But after cleaning, the state does not return.

The other being on a channel no longer evolving (1.25) I suspected the auto-updates of the snaps. I recreated another cluster and tried to abuse it: overload, unexpected restart, snap update... of course it did'nt failed.

So, I recreated it, with blocked update on microk8s, and put in "internal production", it's still working. (hey happy 1th month !)

I noticed recently that the powerdedge which hosts the cluster has very slow disk access (~30MB/S). Because of a Raid1 on a PERC 350, which has no cache (it's not really a PERC). This sometimes causes significant IO delays.

In my opinion, if Maybe the lack of power causes the failure. But then something keeps the charge on the base looping. And create a broken state.

I understand that until you reproduce the problem...

berkayoz commented 1 year ago

Thanks for the feedback @benbuf and @texaclou

We are investigating how resource constraints and loads affect the dqlite cluster. Currently our focus is around storage, observing the characteristics of the k8s-dqlite service with constrained available aio handlers, disk speed, number of iops etc.

We are also trying to find patterns in deployments with workloads, hardware etc. that might give us hints about the source of the problem. This is why having information on the clusters with failures are so valuable to us, and we appreciate everyone taking their time with sharing and helping us in this regard.

There is some progress on this, which we will share here after some more verifications and tests. It might help with some of the reported cases but also keep in mind that this might not be a silver bullet as there are various different environments that had this issue with no consistent way of reproducing. We will continue our efforts on addressing and fixing all of the reported cases/issues.

Many thanks.

jgc234 commented 1 year ago

A little more info on my scenarios - After not being able to recover my 4-node cluster (see above), I rebuilt the cluster as single node native on the host, but still using dqlite. I managed to get microk8s into the same jammed state pretty quickly by kicking off a manual TRIM on an SSD that was hosting root and /var/snap. Unrecoverable, even with no other workloads on the host, and fiddling for a few days to try and get it to start. I rebuilt the single-node system again using etcd.

I was looking at TRIM to see how it impacts performance. Nn my persistent volumes (which are all ZFS), the autotrim is a bit smoother than manual trim, which is a bit lumpy and impactful. I've tried many design variations for the past year or so (nodes in VMs, native nodes on hardware, large disk pools underneath (both HDD and SSD), large hosts that are overkill for the workload, different OS versions, etc). They all died the same way, but some took a few months to degenerate. I haven't found the common factor yet.

berkayoz commented 1 year ago

Hey everyone,

We have been able to reproduce this issue by exhausting available asyncio requests determined by fs.aio-max-nr. Thanks to @jhughes2112 for pointing the workload and scenario out.

For anyone that encounters this issue, could you kindly check fs.aio-nr and fs.aio-max-nr parameters and post it here along with the inspection report? You can check these parameters by running

sudo sysctl fs.aio-nr
sudo sysctl fs.aio-max-nr

fs.aio-nr returns the asyncio requests that are in use currently, and fs.aio-max-nr returns the maximum available requests.

Asyncio exhaustion leads to the CPU (single core) spiking up to 100% and the unresponsive state of the affected node.

The way to recover the node in this situation would go along the steps listed below.

Check if fs.aio-nr is close to fs.aio-max-nr by running the commands below.
```
sudo sysctl fs.aio-nr
sudo sysctl fs.aio-max-nr
```
If the number is close to/over the limit, increase the limit of available asyncio requests by running sudo sysctl fs.aio-max-nr=3145728. You can adjust the limit to be more than what is currently set on your system. The default for this limit is 65536.
Restart the k8s-dqlite service by running sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service. Which should recover the node, and dqlite should start responding again.

A good way of avoiding an issue in a similar manner would be tainting the control plane nodes and avoiding running storage-heavy workloads along the control plane.

Keep in mind this might not be the silver bullet for everyone that encounters this issue, as there are different environments with different workloads. However, we believe we are progressing on the right path and currently focusing our efforts on simulating different storage scenarios to see what else can trigger this issue.

Many thanks.

kuroyi commented 1 year ago

I just had this occur on my 4 node microk8s HA bare metal cluster 2 days ago. It had been running fine for over 220 days. I tried all of the workarounds mentioned except removing all nodes but one. When I restart k8s-dqlite on one node, another node starts using 100% cpu. The highest fs.air-nr on my nodes was approximately 212 but I did increase the max as suggested anyway. I also set the containerd LimitNOFILE=1048576 instead of unlimited although I'm not certain that is a concern.

With all my services broken I upgraded from microk8s 1.25 to 1.26 and then 1.27 and the 100% k8s-dqlite cpu usage remained, although I believe the extra headroom allowed my rook-ceph-orchestrator to fix itself after a lot of trial and error and leaving it running overnight. It seemed to be running ok this morning but now I'm getting this from everything: Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request

Attached is the inspection report I just did on the node that is currently 100% inspection-report-20230804_141023.tar.gz

berkayoz commented 1 year ago

Hey @kuroyi,

Sorry you are also experiencing this issue. The attached inspection report has /dev/sda2 255G 229G 24G 91% / which raises a question about available disk spaces on nodes. Are other nodes also close to the disk space limit? Having inspection reports for those nodes would be helpful as well.

Many thanks.

kuroyi commented 1 year ago

Sometime between Friday night and today my cluster unborked. It started responding normally to microk8s and kubectl commands and k8s-dqlite is no longer at 100% cpu usage.

On Friday my theory was that 'something' was causing k8s-dqlite to churn so I was attempting to lighten the load before finally giving up in frustration. Here is what I recall doing while it was broken (Note that nothing had previously been installed or modified for several months): uninstalled stackgres via helm (I had installed it to try) attempted to disable observability (appears to have failed as it's currently listed as enabled) installed linux kernel 6.2 on one of the nodes and rebooted it

Today after I checked and noticed that the cluster was responsive I freed up some space on the node that was at 91% and reduced it to 80% use%. The cluster was responsive but rook-ceph was unable to Terminate or properly start pods and my services were still broken. It said something about 'connection is Unauthorized'. I noticed that the rook-ceph pods were working on the node I had upgraded to kernel 6.2. I decided to upgrade the rest of the nodes to kernel 6.2 and rebooted them one by one. As each one came back online rook-ceph-operator was able to Terminate and restart the pods and get my storage layer available again. Then I was able to restart everything else that relied on it.

Hopefully this information will help someone else. I can still provide the inspection reports if they would be useful.

lloydjatkinson commented 10 months ago

This is the first time I'm trying microk8s and I'm already seeing unreasonably high CPU usage on a low-spec VM with just the default setup - I haven't even setup any containers. Given microk8s is supposed to be lightweight this isn't great. Following other advice I've seen I disabled HA mode (I don't need it on this single node setup), which in turn disables dqlite in favour of etcd.

The left side is with HA and dqlite and the right is with HA disabled and etcd. This still feels like too much as it ranges anywhere from 3% to 15% doing basically nothing useful, but it is at least not as noisy as it is with HA/dqlite.

lloydjatkinson commented 10 months ago

Comparing microk8s to k3s (again idle, not doing anything) shows that microk8s uses more CPU at idle than other solutions. k3s is much less noisy.

djjudas21 commented 10 months ago

Thanks for the data points @lloydjatkinson. I'm running a MicroK8s HA cluster with dqlite and it does continue to use quite a lot of CPU. Fortunately my hardware is sufficiently powerful that it's not problematic. I know MicroK8s have discussed using etcd for HA deployments but I don't know if they committed to go down that route. At the moment I'm too lazy to migrate, but at my next refresh I'll probably try k3s.

ole1986 commented 8 months ago

We also encountered the problem, after removing all nodes from the cluster. I assume it was also before we reduced them to a single node. But I am wondering why the dqlite ist still running with 100% CPU even there is no other node involved (cluster.yaml contains only one address.)

Btw I know it's all free an open source but wouldn't it be nice to have some more details about dqlite database and "kine" table incl it's columns?

A ".reconfigure" did not resolve the issue.

miro-balaz commented 8 months ago

As @ole1986 mentioned, it would be good to have some more visibility into raft log or dqlite. I just encoutered this problem for the second time in month. I was running 1.21 . After upgrading and removing nodes the one node cluster seems to have recovered all functionality but CPU of dqlite remained high. There is 8 MB of date each five minutes in segments, that seems too much for nothing is being done with the server.

ole1986 commented 8 months ago

@miro-balaz Maybe LIBRAFT_TRACE=1 and LIBDQLITE_TRACE=1 (or 2,3,4) env variables may help you for debug purpose

ktsakalozos commented 8 months ago

Hi @miro-balaz, good to hear the issue you are seeing got resolved. You mention you disabled NodeExporter, dashboard-metrics-scraper and dashboard and you suspect outdated components were causing the high load. Could you please share the versions of the components you were using? We would be interested in reproducing such CPU spikes and if possible guard against them. Thanks.

ole1986 commented 8 months ago

I am almost at the stage of giving up the cluster, so I tried almost everything with the dqlite database.

Not sure if this is helpful to someone. But I completely removed all segments and snapshots, open-x and metadata-x files except ONE snapshot file.

Then I restarted the dqlite service and could still see all my namespaces, services, pods (in whatever state they are).

But still the CPU is 100%.

Snapshots seem to be compressed as lz4. So I uncompressing with unlz4 it and trying to figure out what is inside.

Since the database is so busy I assume there is less chance to successfully try removing deployments, helm releases or anything related to it. Which makes it unusable at that point

miro-balaz commented 8 months ago

Hi, @ktsakalozos it was false finding, cpu of dqlite in htop is 100% now again, but the cluster now works with all the workload, so there is some progress. I didn't noticed when leadership moved to different node. I checked all nodes, but it was not well visible on node with 4 cpus. I deleted the message.

I used versions that come with microk8s v 1.20 or 1.21

miro-balaz commented 8 months ago

I managed to open snapshot in sqlite. It looks like the database would use some compaction, I am trying to understand how compaction works in kine.

@ktsakalozos So far I managed to find highly non-optimal SQL statement, i tested it in sqlite3. I am talking about listSQL from generic.go It is faster to do it using nested query than join.

SELECT * FROM kine AS kv WHERE id = (SELECT MAX(id) FROM kine AS kv2 WHERE kv2.name=<name>) ...; I have to think how to make it work for ranges , probaly we have to go deeper and select distinct names first from range.

It works identicaly for cases when there are not many same names, but i have one name 30000 times in there and it would not even finish in few seconds for example '/registry/leases/..'. is a name that is there many times.

ole1986 commented 8 months ago

@miro-balaz any chance you can give me a short guide on how you have successfully opened the snapshot file with sqllite?

miro-balaz commented 8 months ago

@ole1986 unlz4 as you suggested and then remove first 40bytes. tail -c +41 nodbfile >dbfile

miro-balaz commented 8 months ago

It looks like i finally solved this problem. First I stopped kubelite, snap stop microk8s.daemon-kubelite. (The workloads run just fine, even better if kubelite is shuting them down). Then set debug=true for k8s-dqlite and restart it using snap restart microk8s.daemon-k8s-dqlite, Then in journalctl -f -u snap.microk8s.daemon-k8s-dqlite is saw it is deleting rows. So i assumed it was doing compaction. In my version kine is most likely 0.4.1 which does row by row delete. This version is used by all versions newer k8s-dqlite except master, which includes kine in tree. But it was stuck deleting the ones which there are cca 30000 of then namely /registry/masterleases/..., it is only hypothesis that having a lot of operations with same key is slower or that sql used in compaction is non-optimal in that case, i do not have data or test to prove so, but i imagine it can play a role. But I deleted those recordsm manualy using dqlite client connecting to database. Then after few restarts and minutes, CPU is around 50% and log contains also different messages, not just QUERY,DELETE with '/registry/masterleases/...'. Sorry for not graphig the data, but after that also compaction progress was faster.

This is the summary of most common rows

/registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io|5068
/registry/endpointslices/default/kubernetes|5624
/registry/services/endpoints/default/kubernetes|6130
/registry/leases/kube-node-lease/|10418
/registry/leases/kube-node-lease/|12247
/registry/leases/kube-node-lease/|15378
/registry/leases/kube-system/kube-controller-manager|23421
/registry/masterleases/|28249
/registry/masterleases/|30344
/registry/masterleases/|35107
/registry/leases/kube-system/kube-scheduler|35253

This was probably related to flapping that occured when there was some infrastructure error.

miro-balaz commented 8 months ago

@ole1986 I have another suggestion you can connect to dqlite while cluster is running. just inject your because it does not listen to localhost.

/snap/microk8s/current/bin/dqlite -s <ip>:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s

and check if there is too much to compact. This is example with results on my cluster. First one is too big, and 999 is minimum possible value.

dqlite> select TIME(),max(id)-(SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key') from kine; 01:13:12|105311 dqlite> select TIME(),max(id)-(SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key') from kine; 08:45:46|999

ole1986 commented 8 months ago

@miro-balaz thank you. We already gave up our cluster. But I could get into the database through the uncompressed snapshot file and also noticed redundant registry entries while often only a single byte in value/old_value column is being changed for the same name

Example for /registry/masterleases/10.251.6.106 (first and second occurrence)

-- SQL Statement
SELECT name, created, deleted, lease, value from kine WHERE name LIKE '/registry/masterleases/%';
-- RESULT
name                                    COUNT(*)
/registry/masterleases/10.251.6.106 3417
/registry/masterleases/10.251.6.91  1055
/registry/masterleases/10.251.6.92  172
/registry/masterleases/10.251.6.93  922
/registry/masterleases/10.251.6.94  4015
/registry/masterleases/10.251.6.95  1425

-- SQL Statement
SELECT name,COUNT(*)from kine WHERE name LIKE '/registry/leases/%' GROUP BY name;
- RESULT
name                                                                    COUNT(*)
/registry/leases/default/external-attacher-leader-rbd-csi-ceph-com      842
/registry/leases/default/external-resizer-rbd-csi-ceph-com              1046
/registry/leases/default/external-snapshotter-leader-rbd-csi-ceph-com   783
/registry/leases/default/rbd-csi-ceph-com                               977
/registry/leases/default/rbd.csi.ceph.com-default                       820
/registry/leases/kube-node-lease/xxxxx01.xxxxx.local                    830
/registry/leases/kube-node-lease/xxxxx02.xxxx.local                     118
/registry/leases/kube-node-lease/xxxxx03.xxxx.local                     442
/registry/leases/kube-node-lease/xxxxx04.xxxx.local                     2955
/registry/leases/kube-node-lease/xxxxx05.xxxx.local                     954
/registry/leases/kube-node-lease/xxxxx06.xxxx.local                     2364
/registry/leases/kube-system/apiserver-4hmltwx2ygxotcrsm42ilrqwje       926
/registry/leases/kube-system/apiserver-auwbsyd25mw2gwj6qfwwlx7gja       1401
/registry/leases/kube-system/apiserver-cyvthvhws355b6yxwgf5iwn3dm       2810
/registry/leases/kube-system/apiserver-gxnzhfuf6xkdgdo54typdcalna       1030
/registry/leases/kube-system/apiserver-nmgvfj7tz3ojxk35fkkajlj7ga       176
/registry/leases/kube-system/apiserver-v43ksimizpq43irz6v7ilm5niq       4825
/registry/leases/kube-system/kube-controller-manager                    6171
/registry/leases/kube-system/kube-scheduler                             6929
/registry/leases/kube-system/nfs-csi-k8s-io                             500

BTW: how do you know what records to keep while deleting? Simple the one with the highest id (primary key)? I am also concerned about the created flag which sometimes is set to 1 but mostly set to 0 for the same name.

IMO there should be some cleanup procedure or at least a quick guide on how to manually apply compaction or maintain the database.

snydergd commented 7 months ago

I read through this thread looking for a solution and wondered if I could/should use etcd with microk8s instead, and whether that's supported. I found that I could disable the ha-cluster addon with microk8s disable ha-cluster after backing up my config/pvs. Obviously dqlite will be preferrable/needed for production/multi-node clusters, but for my homelab with legacy hardware, I think this is a good alternative.

Sharing in case others are in the same spot reading through.

david-streamlio commented 7 months ago

@snydergd I have come to the same conclusion myself, and have therefore started the process of using etcd with microk8s instead of k8s-dqlite. I have documented the process thus far in this repo, in order to share my experience with others who decide to go down this path as well.

miro-balaz commented 7 months ago

@david-streamlio @snydergd I think it would be enough to compile k8s-dqlite with new version of kine, or bump version of kine for older versions of dqlite and create new releases.

david-streamlio commented 7 months ago

@miro-balaz Thanks for the suggestion, but I am not familiar enough with the snap package manager to figure out how to make microk8s use the newly compiled version of kine. It would be a lot easier if the newer version of the k8s-dqlite was available on a snap release branch, edge perhaps?

Secondly, how do I know that the new version of the k8s-dqlite resolves this issue?

miro-balaz commented 7 months ago

I have encountered this bug more than once, after an out of disk space, and in two cases I was able to examine the database and it contained overwhelmingly large gaps. The reason for gaps might be that the sqlite autoincrement increments also on failed insert. That means that if there is out of disk space condition, autoincrement can still be incremented and as a result, when conditions resolve, a record with a very high id is inserted. This causes kine to insert special records to fill all the gaps, but those records take a lot of space and also their inserting is time consuming. After that the database is slower and the compaction algorithm is too slow to keep up. After a certain period of time, the database is too slow and certain critical actions start to timeout and it cascades to an unusable cluster. New kine has a faster compaction algorithm that will mitigate this scenario. I am not sure if gaps are also deleted with compaction, but they can be deleted manually.

I will try to provide Dockerfile with compiled k8s-dqlite with a higher version of kine.

st 17. 1. 2024 o 19:28 David Kjerrumgaard @.***> napísal(a):

@miro-balaz https://github.com/miro-balaz Thanks for the suggestion, but I am not familiar enough with the snap package manager to figure out how to make microk8s use the newly compiled version of kine. It would be a lot easier if the newer version of the k8s-dqlite was available on a snap release branch, edge perhaps?

Secondly, how do I know that the new version of the k8s-dqlite resolves this issue?

— Reply to this email directly, view it on GitHub https://github.com/canonical/microk8s/issues/3227#issuecomment-1896392587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IHUY2PYSCOWEBFMPDULLYPAJ5NAVCNFSM5YFUCNGKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBZGYZTSMRVHA3Q . You are receiving this because you were mentioned.Message ID: @.***>

snydergd commented 7 months ago

@miro-balaz Thank you for this great analysis! My problem is consistent with what you said because it began after my disk was nearly full (all my pods were being evicted).

I have a backup of the dqlite folder, so will probably try this later today, or as soon as I can get k8s-dqlite built with newer kine and installed/running.

snydergd commented 7 months ago

@miro-balaz If you happen to have time, I would be very grateful for a Dockerfile or forks of the repos with updated kine working. I think I managed to get the latest version of kine (from github.com/k3s-io/kine) working with k8s-dqlite (main challenge was that k8s-dqlite was setting up go-dqlite's "app" to create/register the driver with database/sql before starting kine, while kine was setting up its own -- and for some reason which I still don't understand, the "app" was registering "dqlite-" as the provider name instead of just "dqlite"). It was a lot harder than I thought, but in the end, running with debug was only showing some polling queries happening rather than the compaction queries from before, and cpu usage was way way lower. For some reason kubelite says it can't connect to it through the socket, though (complaints of EOF while reading, as well as some connection refused messages), so I think I may be missing something still. I'm happy to try other things or show what I've got if that's helpful -- would love to try it out and share my results here. I have a tarball of /var/snap/microk8s/current/var/kubernetes/backend/ in a broken state that I can use to try things out with.

rich0 commented 7 months ago

I ran into this issue, with my cluster becoming very unstable, and dqlite going to 100% on the leader node, and kubelite constantly crashing.

I ran the query: select TIME(),max(id)-(SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key') from kine;

The result was 50-100k records and it was trending upwards slowly.

I did some of the dqlite recovery suggestions from the blog post (copying the files from the most current dqlite node to the others) and that didn't seem to help.

However, what did seem to help is stopping kubelite on all nodes, and then just leaving dqlite running. The uncompacted record count gradually trended downwards. The dqlite process eventually dropped below 100% CPU and the NVMe went to 100%, so it became IO-bound instead of CPU-bound. The uncompacted record count dropped to a few thousand. For some reason the memory use on that process started climbing and I had to restart the unit when it got to 28GB of RAM. Then a new leader took over, and went to 100% CPU and the record count went all the way back up to 90k or so. However, it eventually dropped and seemed stable.

At this point I restarted kubelite and for the most part it ran. API calls (eg kubectl get nodes) returned results very quickly - previously it would take maybe 10+ seconds to return (not normal in this cluster).

However, I suspect there is some kind of damage within the database as API calls return data that appears incorrect. For example, I'm running rook, and it looks like all my OSDs are running (ceph status reports this). However, a dump of the OSD pods from kubectl lists many as terminating/pending. The processes are clearly running.

I might play around with this a bit more to see if I can resurrect the cluster. If not I'll try to back up the rook state and probably will follow the rook disaster recovery and try to get my OSDs working in k3s or something, and rebuild the cluster.

rich0 commented 7 months ago

A bit of an update: after rebooting a few nodes I now have a consistent API that looks normal (pod states reflect reality). However, if I start increasing the load on the cluster (especially with monitoring - prometheus/loki/etc) the dqlite leader saturates IO on its SSD and the record count starts creeping upwards again. Stopping kubelite on all nodes allows it to catch up. However, even with a fairly minimal load on the cluster dqlite is running near 100% CPU on the leader.

My theory is that there is something wrong with the database causing it to require more resources to do the same work over time. If it saturates IO or CPU then it turns into a positive feedback loop.

canonical / microk8s