canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.4k stars 766 forks source link

apiserver timeouts on SSD #4163

Open jamesstidard opened 1 year ago

jamesstidard commented 1 year ago

Summary

The cluster I'm managing has gone into a bad state where pods are not being scheduled or terminated (un-forcibly). Kubernetes provides no Events pointing me to problems, and the nodes seem available. Ended up digging into log files to try and find the problem.

One thing I've noticed looking in /var/log/syslog is, what appears to me, as long api calls from microk8s.daemon-kubelite. Specifically one, which looks like the Raspberry Pi with low disk performance case on your troubleshooting page. However these are desktop grade consumer hardware with SSDs attached.

I'm seeing values for "Object stored in database" transition times of: 0.8 - 32 seconds.

Drive Performance

Node 1
node@node1:~$ sudo hdparm -Tt /dev/sdc
/dev/sdc:
 Timing cached reads:   37266 MB in  1.98 seconds = 18806.90 MB/sec
 Timing buffered disk reads: 1572 MB in  3.00 seconds = 523.46 MB/sec
node@node1:~$ dd if=/dev/zero of=/tmp/test1.img bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.779965 s, 1.4 GB/s
Node 2
node@node2:~$ sudo hdparm -Tt /dev/sdc
/dev/sdc:
 Timing cached reads:   38138 MB in  1.98 seconds = 19248.73 MB/sec
 Timing buffered disk reads: 1522 MB in  3.00 seconds = 506.86 MB/sec
node@node2:~$ dd if=/dev/zero of=/tmp/test1.img bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.784317 s, 1.4 GB/s
Node3
node@node3:~$ sudo hdparm -Tt /dev/sdc
/dev/sdc:
 Timing cached reads:   37240 MB in  1.98 seconds = 18795.10 MB/sec
 Timing buffered disk reads: 1606 MB in  3.00 seconds = 534.94 MB/sec
and: No such file or directory
node@node3:~$ dd if=/dev/zero of=/tmp/test1.img bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.782774 s, 1.4 GB/s

Introspection Report

Currently hanging on Inspect kubernetes cluster. So unable to attach a tarball. Maybe a longer running task then I'm expected (how long should this take?)

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

Logs

I'm happy to provide some logs, such as the /var/log/syslog if helpful, though I'd need to be assured there's no sensitive data in them before uploading.

Thanks for any help in diagnosing this issue.

ktsakalozos commented 1 year ago

Thank you for reporting this @jamesstidard .

Would it be possible to share the logs of kubelite, k8s-dqlite, and the apiserver-kicker? Please review them so they do not reveial any sensitive data.

journalctl -n 1000  -u snap.microk8s.daemon-apiserver-kicker
journalctl -n 1000  -u snap.microk8s.daemon-kubelite
journalctl -n 1000  -u snap.microk8s.daemon-k8s-dqlite

I am looking for a hint as to what the services are doing during this kind on incidents.

jamesstidard commented 1 year ago

Hi @ktsakalozos

Very much appreciate the response.

I have 3 manager nodes, so I've included the output from all 3 nodes.

node1-apiserver-kicker.log node1-dqlite.log node1-kubelite.log

node2-apiserver-kicker.log node2-dqlite.log node2-kubelite.log

node3-apiserver-kicker.log node3-dqlite.log node3-kubelite.log

jamesstidard commented 1 year ago

I've restarted my master nodes, and I've been able to see some events.

9m57s       Normal    Starting                  node/node3    Starting kubelet.
9m57s       Normal    Starting                  node/node3
9m57s       Warning   InvalidDiskCapacity       node/node3    invalid capacity 0 on image filesystem
9m57s       Normal    NodeAllocatableEnforced   node/node3    Updated Node Allocatable limit across pods
9m57s       Normal    NodeHasSufficientMemory   node/node3    Node node3 status is now: NodeHasSufficientMemory
9m57s       Normal    NodeHasNoDiskPressure     node/node3    Node node3 status is now: NodeHasNoDiskPressure
9m57s       Normal    NodeHasSufficientPID      node/node3    Node node3 status is now: NodeHasSufficientPID
8m14s       Normal    Starting                  node/node2    Starting kubelet.
8m13s       Normal    Starting                  node/node2
8m14s       Warning   InvalidDiskCapacity       node/node2    invalid capacity 0 on image filesystem
8m14s       Normal    NodeHasSufficientMemory   node/node2    Node node2 status is now: NodeHasSufficientMemory
8m14s       Normal    NodeHasNoDiskPressure     node/node2    Node node2 status is now: NodeHasNoDiskPressure
8m14s       Normal    NodeHasSufficientPID      node/node2    Node node2 status is now: NodeHasSufficientPID
8m14s       Normal    NodeAllocatableEnforced   node/node2    Updated Node Allocatable limit across pods
6m48s       Normal    Starting                  node/node3    Starting kubelet.
6m47s       Normal    Starting                  node/node3
6m48s       Warning   InvalidDiskCapacity       node/node3    invalid capacity 0 on image filesystem
6m47s       Normal    NodeAllocatableEnforced   node/node3    Updated Node Allocatable limit across pods
6m47s       Normal    NodeHasSufficientMemory   node/node3    Node node3 status is now: NodeHasSufficientMemory
6m47s       Normal    NodeHasNoDiskPressure     node/node3    Node node3 status is now: NodeHasNoDiskPressure
6m47s       Normal    NodeHasSufficientPID      node/node3    Node node3 status is now: NodeHasSufficientPID
5m7s        Normal    Starting                  node/node2    Starting kubelet.
5m6s        Normal    Starting                  node/node2
5m7s        Warning   InvalidDiskCapacity       node/node2    invalid capacity 0 on image filesystem
5m7s        Normal    NodeAllocatableEnforced   node/node2    Updated Node Allocatable limit across pods
5m7s        Normal    NodeHasSufficientMemory   node/node2    Node node2 status is now: NodeHasSufficientMemory
5m7s        Normal    NodeHasNoDiskPressure     node/node2    Node node2 status is now: NodeHasNoDiskPressure
5m7s        Normal    NodeHasSufficientPID      node/node2    Node node2 status is now: NodeHasSufficientPID
4m59s       Warning   MissingClusterDNS         node/node2    kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
3m42s       Normal    Starting                  node/node3    Starting kubelet.
3m41s       Normal    Starting                  node/node3
3m42s       Warning   InvalidDiskCapacity       node/node3    invalid capacity 0 on image filesystem
3m42s       Normal    NodeAllocatableEnforced   node/node3    Updated Node Allocatable limit across pods
3m42s       Normal    NodeHasSufficientMemory   node/node3    Node node3 status is now: NodeHasSufficientMemory
3m42s       Normal    NodeHasNoDiskPressure     node/node3    Node node3 status is now: NodeHasNoDiskPressure
3m42s       Normal    NodeHasSufficientPID      node/node3    Node node3 status is now: NodeHasSufficientPID
114s        Normal    Starting                  node/node2    Starting kubelet.
113s        Normal    Starting                  node/node2
114s        Warning   InvalidDiskCapacity       node/node2    invalid capacity 0 on image filesystem
114s        Normal    NodeHasSufficientMemory   node/node2    Node node2 status is now: NodeHasSufficientMemory
114s        Normal    NodeHasNoDiskPressure     node/node2    Node node2 status is now: NodeHasNoDiskPressure
114s        Normal    NodeHasSufficientPID      node/node2    Node node2 status is now: NodeHasSufficientPID
114s        Normal    NodeAllocatableEnforced   node/node2    Updated Node Allocatable limit across pods
46s         Normal    Starting                  node/node3    Starting kubelet.
45s         Normal    Starting                  node/node3
46s         Warning   InvalidDiskCapacity       node/node3    invalid capacity 0 on image filesystem

So it looks like the kubelet service cannot start and is looping on all nodes. I've googled the invalid capacity 0 warning and that is initially looking like potentially a false flag (plus it's a warning and not an error). I've looked in the kubelite logs for the node (like the ones above), but I do not feel familiar enough to see the root cause (or if it's even logged in there).

Let me know if there's any more information I can provide, and thanks again for any assistance.

ktsakalozos commented 1 year ago

Let me briefly describe how the k8s datastore works and hopefully this would allow us to pinpoint the reason of such long delays in accessing it.

The data store in k8s is based on a consensus protocol, raft. In an HA deployment (ie more than 2 nodes) a leader node is elected and two more nodes become voters. There has to be quorum among the three nodes (leader and the voters) for an entry to be persisted in the datastore. When a write request reaches the datastore it is propagated to the leader and the leader waits until the majority of the nodes to persist the request data. When a read request hits the datastore it is forwarded to the leader and the leader responds to it.

The reasons why the datastore is slow can be traced to disk, network or CPU contention. If we have eliminated the disk issue we can move to looking into network utilization. One suggestion, is to stop all storage services running on the voters and leader. For example, I think you are running openebs. Is it possible openebs transfers large blocks causing network hiccups ? Can you try stopping openebs and/or move it to worker nodes? Another approach is to have an external etcd on dedicated hardware to hold the k8s datastore.

jamesstidard commented 1 year ago

Thanks for the response. I'm unfortunately not able to stop or move my openebs services, as I get timeout errors. I tried appending --timeout=120s to my kubectl commands, though it doesn't seem to change the timeout duration. Is there a way I can increase the timeout so I'm able to still manage the cluster during this period, and I can start spinning down any running openebs services?

Thanks

jtcressy commented 1 year ago

Jumping onto this thread, albeit with borderline anecdotal evidence. I have reason to believe i've encountered this same issue, after running a dqlite HA microk8s cluster for about 4 months as a production home lab cluster. This problem seemed to start occurring a couple days ago, with nodes randomly going NotReady and pods getting stuck across the whole cluster. The entire API server has been extremely slow to respond at all to the point that I began attempting to recover the cluster.

Ultimately, what I tried doing was removing dqlite members with microk8s leave, snap remove --purge microk8s and reinstalling microk8s on each node, then joining back to the cluster in a rolling fashion. It could maintain quorum the whole time, as shortly after adding a 3rd node each time it re-enabled HA mode with 3 voting members. But the problems never went away and my cluster was still broken, with workloads unable to fully restore. What I noticed when restarting dqlite masters, whichever node became the leader node consistently saw constant 100% single-core usage for the dqlite process even when I disable all other microk8s services like kubelite. I was able to cause leader elections to move the leader node around my cluster, even adding more powerful hardware in case this was just a raft backlog, but even on my most powerful machines (single-core perf) the issue didn't go away. I even tried to remove all master nodes except one to reduce the cluster to a single node cluster, but the constant 100% single-core usage remained.

So as it stands my cluster is effectively dead and unrecoverable. I tried restoring earlier backups but they had no effect on solving this issue. My cluster version was 1.26.8 in the 1.26/stable channel. RIP little guy, i'm going to move forward with building a new cluster on a different k8s distribution and restoring my volumes from offsite backup. Not that it matters, but i'm probably never going to use microk8s again.

franco-martin commented 10 months ago

I found the same issue as @jtcressy, but restoring the backup worked just fine. https://github.com/canonical/microk8s/issues/3706