canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.52k stars 772 forks source link

Unresponsive Cluster with Partial Service Availability #4307

Open johnjairus10 opened 1 year ago

johnjairus10 commented 1 year ago

Summary

I am facing an issue with my four-node MicroK8s cluster used as a staging server. This morning, the cluster stopped responding randomly, with some exposed services still online while others are not accessible. Previously, a similar issue was resolved by resetting MicroK8s, but this is no longer viable due to critical data on the database.

Environment

What Should Happen Instead?

The cluster should remain consistently accessible, and all services should be operational without random outages.

Reproduction Steps

n/a

Introspection Report

microk8s inspect completes after an extended period. The report indicates:

Additional Observations

Can you suggest a fix?

Given my limited experience with MicroK8s, I am seeking assistance to identify and resolve this issue. It appears to involve connectivity problems with the Kubernetes API server. Suggestions on troubleshooting steps or configuration checks would be highly appreciated. Thank you!

Are you interested in contributing with a fix?

No, I am currently not in a position to contribute a fix.

ktsakalozos commented 12 months ago

Hi @johnjairus10, I see in the logs of the k8s-dqlite service that queries to the k8s datastore are taking too long. eg: Nov 17 06:37:07 gitlabrunner microk8s.daemon-k8s-dqlite[4060539]: time="2023-11-17T06:37:07Z" level=error msg="error in txn: query (try: 0): context deadline exceeded". One reason why this may happen is when there is I/O intensive workloads that leave the k8s datastore starving. I also see you have longhorn running. Would it be possible to stop longhorn and see if the cluster recovers? Note that you would have seen the same if you were using other storage solutions for k8s. We have encountered similar behavior with Ceph. You can consider having dedicated nodes to host the k8s control plane.

johnjairus10 commented 12 months ago

Hi there, thanks for the insights regarding the k8s-dqlite service issues. I understand that the slow query performance could be linked to I/O intensive workloads impacting the k8s datastore. I'll go ahead and temporarily disable Longhorn to see if this alleviates the problem and improves the performance of our Kubernetes cluster.

Also, I wanted to ask for your opinion on an alternative storage solution. Considering our cluster is mainly used for local testing, would you recommend using hostpath-storage instead of a more complex storage solution like Longhorn or Ceph?

Lastly, I appreciate your advice on setting up dedicated nodes for the Kubernetes control plane. It seems like a sensible approach to enhance cluster stability and performance. I'll start planning for this configuration change to ensure better resource allocation and management within our environment.

Thanks again for your help and recommendations @ktsakalozos !

senasi commented 12 months ago

Hello. We are experiencing same problems for a few days with our staging/testing cluster with similar configuration. It's 4-node local cluster with HA enabled, originally on v1.28.3 when problems appeared, but I have upgraded it to v1.28.4 since then which didn't help.

Same error messages from dqlite:

Nov 22 10:37:49 pink microk8s.daemon-k8s-dqlite[24524]: time="2023-11-22T10:37:49Z" level=debug msg="GET /registry/health, rev=0 => rev=0, kv=false, err=query (try: 0): context deadline exceeded"

or from kubelite:

Nov 22 10:40:39 pink microk8s.daemon-kubelite[20207]: E1122 10:40:39.473824   20207 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc0002d4948)}: rpc error: code = Unknown desc = query (try: 0): context deadline exceeded
Nov 22 10:40:39 pink microk8s.daemon-kubelite[20207]: E1122 10:40:39.473963   20207 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout

I'm sure this is not linked to I/O intensive workloads, as this cluster (where each node is bare metal 2 cpu / 16 core, ssd disks) was practically without any load.

I have already tried following procedures:

I will try to examine kine/etcd wrapper today, which is the next in line component. But I'd appreciate any advise.

stale[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.