canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.49k stars 773 forks source link

Nodes crashed. Kubelet keeps restarting. k8s-dqlite 100% cpu. node.kubernetes.io/unreachable:NoSchedule taint #3859

Closed Dunge closed 1 year ago

Dunge commented 1 year ago

Summary

Ubuntu 20.04.5 LTS microk8s 1.26.1 installed via snap 3 nodes HA clusters

I came back one morning and my cluster that was working fine the previous day was completely down. Unlike my previous similar experience, this time the disks still have space remaining.

microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.

At this point microk8s inspect would just freeze at the Gathering system information step after printing that the services were running.

I then launched a microk8s start, nothing for a few minutes, then exit with no message. status still say not running.

I tried to check the logs using sudo journalctl -u snap.microk8s.daemon-kubelite but there’s too much stuff, couldn’t find anything relevant.

I then rebooted the nodes sudo reboot and they actually came back online status Ready.

The condition look good:

    Conditions:
    Type Status LastHeartbeatTime LastTransitionTime Reason Message

    NetworkUnavailable False Thu, 23 Feb 2023 04:36:12 +0000 Thu, 23 Feb 2023 04:36:12 +0000 CalicoIsUp Calico is running on this node
    MemoryPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
    DiskPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
    PIDPressure False Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
    Ready True Mon, 20 Mar 2023 20:38:48 +0000 Mon, 20 Mar 2023 20:10:26 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled

But there is a Taint preventing any workload to get scheduled on them:

    Taints: http://node.kubernetes.io/unreachable:NoSchedule

Events on pods:

    0/3 nodes are available: 3 node(s) had untolerated taint {http://node.kubernetes.io/unreachable: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling…

Events seems to crash the kubelet service on loop:

    Warning InvalidDiskCapacity 9m38s kubelet invalid capacity 0 on image filesystem
    Normal NodeHasSufficientPID 9m38s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
    Normal NodeAllocatableEnforced 9m38s kubelet Updated Node Allocatable limit across pods
    Normal NodeHasNoDiskPressure 9m38s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
    Normal NodeHasSufficientMemory 9m38s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
    Normal Starting 9m38s kubelet Starting kubelet.
    Normal Starting 7m7s kubelet Starting kubelet.
    Warning InvalidDiskCapacity 7m7s kubelet invalid capacity 0 on image filesystem
    Normal NodeHasNoDiskPressure 7m6s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
    Normal NodeAllocatableEnforced 7m6s kubelet Updated Node Allocatable limit across pods
    Normal NodeHasSufficientPID 7m6s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
    Normal NodeHasSufficientMemory 7m6s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
    Normal Starting 2m57s kubelet Starting kubelet.
    Warning InvalidDiskCapacity 2m57s kubelet invalid capacity 0 on image filesystem
    Normal NodeHasSufficientMemory 2m57s kubelet Node jldocker-1 status is now: NodeHasSufficientMemory
    Normal NodeHasNoDiskPressure 2m57s kubelet Node jldocker-1 status is now: NodeHasNoDiskPressure
    Normal NodeHasSufficientPID 2m57s kubelet Node jldocker-1 status is now: NodeHasSufficientPID
    Normal NodeAllocatableEnforced 2m57s kubelet Updated Node Allocatable limit across pods

The "Invalid Disk Capacity" part seems to be normal, because the pods handling the disk CSI (Longhorn) are not running because of the node taint.

Any command I send to microk8s kubectl print some error message about memcache.go:

    E0320 20:41:19.730002 50241 memcache.go:255] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0320 20:41:19.733484 50241 memcache.go:106] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0320 20:41:19.735017 50241 memcache.go:106] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0320 20:41:19.738322 50241 memcache.go:106] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0320 20:41:20.827654 50241 memcache.go:106] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request
    E0320 20:41:20.830214 50241 memcache.go:106] couldn’t get resource list for http://metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Any idea what is causing this? From my point of view, nodes seems healthy, ram, cpu and disk are good, networking between them is also functional.

Introspection Report

Here is my microk8s inspect tarball file hosted on my Google Drive: https://drive.google.com/file/d/1LEaCI3O2Y82OI4Q2quo-uIBD-zV8rZBA/view?usp=sharing

Dunge commented 1 year ago

Small follow up,

I did a microk8s stop on all 3 nodes because they were just crash looping wit errors all the time. A few hours later I came back and tried to pop just the first node back up, and it doesn't come up.

$ microk8s status
microk8s is not running, try microk8s start
$ microk8s start
********************************** about 2-3 minutes before it respond ********************************** 
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
**********************************  about 5 minutes before it respond ********************************** 
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/4595/inspection-report-20230321_033014.tar.gz
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s kubectl get nodes
E0321 03:32:43.569188  861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout
E0321 03:33:03.576104  861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:47630->127.0.0.1:16443: read: connection reset by peer
E0321 03:33:24.569670  861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:50388->127.0.0.1:16443: read: connection reset by peer
E0321 03:33:34.571132  861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout
E0321 03:33:45.981150  861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:54130->127.0.0.1:16443: read: connection reset by peer
error: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:54130->127.0.0.1:16443: read: connection reset by peer

I uploaded the new (much smaller) tarball here: https://drive.google.com/file/d/1AEA23MpRMeCeHkfFlh7t_jMOKLXkj4Mf/view?usp=sharing

Might be easier to diagnose (EDIT: Do not analyze this one, this was when a single node was active, and do not show the same issue)

I see a lot of this error in the kubelite log, not sure what it could mean:

microk8s.daemon-kubelite[854750]: W0321 03:24:11.013071  854750 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {
microk8s.daemon-kubelite[854750]:   "Addr": "unix:///var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379",
microk8s.daemon-kubelite[854750]:   "ServerName": "kine.sock",
microk8s.daemon-kubelite[854750]:   "Attributes": null,
microk8s.daemon-kubelite[854750]:   "BalancerAttributes": null,
microk8s.daemon-kubelite[854750]:   "Type": 0,
microk8s.daemon-kubelite[854750]:   "Metadata": null
microk8s.daemon-kubelite[854750]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379: connect: connection refused"
Dunge commented 1 year ago

I realized it's normal a one node system doesn't work (no quorum). As soon as I restarted node2 it reports "running" again. But still have the taint (only one the second node?) and the memcache.go errors, and I still have Starting kubelet. events every few minutes.

Weird thing, even node3 is reported as "Ready" when microk8s is still stopped on it!

Look extremely similar to #3204 and #3735 except I don't quite understand the conclusion other than "I'm moving to another system".

Dunge commented 1 year ago

From what I understand, the main issue is the kubelite processes restarting every few minutes. Looking at the logs, it seems to be related to leader election. Why would the cluster spend its time trying to elect a different leader and loop around that?

This is the relevant part of the log that seems to happen often. Something about a timeout updating a lock?

microk8s.daemon-kubelite[53319]: E0320 20:46:08.223079   53319 leaderelection.go:367] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=15s": context deadline exceeded
microk8s.daemon-kubelite[53319]: I0320 20:46:08.223143   53319 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
microk8s.daemon-kubelite[53319]: E0320 20:46:08.223210   53319 controllermanager.go:294] "leaderelection lost"
systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 10.
systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
systemd[1]: Started Service for snap application microk8s.daemon-kubelite.

@MathieuBordere I see you helped a few similar situations in the past, any thoughts?

Dunge commented 1 year ago

I managed to remove the taint on node2 manually with kubectl taint nodes jldocker2 node.kubernetes.io/unreachable:NoSchedule-, didn't help.

Being absolutely out of ideas I tried to turn the cluster back to a real single node with no HA:

At this point, microk8s status on node 2 say it is alone, and without HA.

And yet, nothing work! So my problem is not because of the HA cluster?

kubelite actually stopped crash looping, but it didn't resolve anything. But I can't manipulate any resource in the cluster, everything is like frozen. Pods are stuck either creating or terminating and nothing I do change their status (even those in kube-system). Every commands are extremely slow to respond and most timeout. There's no events. I tried to delete pods in longhorn-system and kube-system with --force to see if they would respawn, but nope they just remains there, even worse some are still reporting as running on nodes that doesn't exist anymore.

Maybe a little late, but I then realized k8s-dqlite use 100% CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1586219 root      20   0 2552.0m 974.7m  18.8m S 105.9  16.4   1:58.30 /snap/microk8s/4595/bin/k8s-dqlite --storage-dir=/var/snap/microk8s/4595/var/kubernetes/backend/ --listen=unix:///var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379

Is that because the dqlite database is corrupted?

I tried to stop, run the "reconfigure" command from this article, and restart: https://microk8s.io/docs/restore-quorum Didn't change anything

PLEASE HELP!

balchua commented 1 year ago

@Dunge after turning the cluster into a single node, do you see if k8s-dqlite is crash looping? This process is not a pod, but a systemd though.

Curious to know what are running in the cluster before it went down?

Dunge commented 1 year ago

No, k8s-dqlite is/was using 100% cpu, but not crash looping. What was crash looping was kubelite.

I'm currently in the process of trying to re-attach the nodes (Recover from lost quorum) article. I copied the files, and trying to issue the add-node/join commands now, but it takes like 10 minutes to respond. I guess I'll go with a microk8s reset soon and start from scratch, hoping my Longhorn data survive.

My cluster have as I mentioned, Longhorn as the volume CSI, and in the default namespace I have a rabbitmq cluster, a redis cluster (both from bitnami Helm charts), a sftp server, a nuget server (liget), and 11 of my own dotnet services, none doing anything really special.

Dunge commented 1 year ago

After attempting the add a node back (it took forever), I can't even see the cluster status anymore..

$ microk8s status
Traceback (most recent call last):
  File "/snap/microk8s/4985/scripts/wrappers/status.py", line 209, in <module>
    enabled, disabled = get_status(available_addons, isReady)
  File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 533, in get_status
    kube_output = kubectl_get("all,ingress")
  File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 215, in kubectl_get
    return run(KUBECTL, "get", cmd, "--all-namespaces", die=False)
  File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 77, in run
    result.check_returncode()
  File "/snap/microk8s/4985/usr/lib/python3.8/subprocess.py", line 448, in check_returncode
    raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('/snap/microk8s/4985/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.
balchua commented 1 year ago

I normally do a sudo snap remove microk8s --purge followed by sudo snap install microk8s... instead of microk8s reset. The reset requires a running kubernetes as it interacts with the api server.

I am trying to understand what could be the reason the cluster went south. Does the node have enough memory/cpu to run your workloads plus microk8s?

Dunge commented 1 year ago

Thank, I'll do that.

All three nodes had more than enough, and it ran fine for about 3 months. A single node does not have enough RAM for everything, but the workload didn't even start when it was single anyway (could not mount volumes since longhorn and coredns was also down).

But since this was a long term stability test to determine if the solution was stable enough for production, the result is obviously failed.

Dunge commented 1 year ago

Yeah something is really wrong with my node2, I'm not even able to install it anymore!

I did a sudo snap remove microk8s --purge on all three no problem, then installed it back sucessfully on 1&3. But node2 (which is currentl a 0% cpu usage, 6GB free memory and ton of free disk space, along with practically no other application installed or running on that linux vm, it's practically empty):

sudo snap install microk8s --classic --channel=1.26
error: cannot perform the following tasks:
- Start snap "microk8s" (4595) services (systemctl command [start snap.microk8s.daemon-containerd.service] failed with exit status 1: Job for snap.microk8s.daemon-containerd.service failed because a timeout was exceeded.
See "systemctl status snap.microk8s.daemon-containerd.service" and "journalctl -xeu snap.microk8s.daemon-containerd.service" for details.
)

what the hell? And of course the snap.microk8s.daemon-containerd.service log only say Failed with result 'timeout'.

Dunge commented 1 year ago

After issuing the install command 4 times, it finally went though?

I then had issue where my coredns service and calico pods were all green, but namespace resolution of services inside the cluster would fail. I had to kubectl -n kube-system rollout restart deployment coredns to fix it. Weird.

Now I still have some issues spawning workload, mostly when it comes to node2, sometime it works, sometime it just crash loop the pods. I'm still unable to find out what this node have different than the other and why it cause problem.

Dunge commented 1 year ago

All seems to now be related to a coredns issue. I tried to follow this guide to diagnose: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

From some pod, contacting services works properly, from other it doesn't. Doesn't matter what node they run on. nslookup will timeout on some, while work on other. /etc/resolv.conf point correctly to the IP of the core-dns service on all of them. But from some pods the request doesn't end up in the core-dns pod (as seen by enabling logs on the core-dns configmap). I can ping the pod directly.

I tried to disable and re-enable the dns plugin, it didn't help.

From the guide above, I sometimes see some "SERVFAIL" logs. They mention checking clusterroles. Only difference I see is that the "node get" permission isn't in the core-dns clusterrole, but that's filled in by microk8s, I doubt this is an issue.

Really stumped again, I don't know how I manage to get myself in these situation, this should "just work", but it doesn't.

balchua commented 1 year ago

Is it worth to look at the difference between node 1 and 3 system? Knowing that both 1 & 3 are working. The fact that it requires you to install microk8s on node2 several times isn't a good sign.

Dunge commented 1 year ago

Yeah I know, that bothers me too. They are all hosted at the same place (VMs under a vmware hypervisor), same specs, same os, same cpu/ram/disk usage, same softwares installed, same network. I did network and storage benchmarks and they are the same, I don't see anything strange in the syslogs. I really can't understand why microk8s refused to install with timeouts after being purged when the system was practically idle. They are not potatoes, should have power enough for that.

In any case, after installation, recreating a HA cluster went fine, and never saw the dqlite cpu issue again.

Recap:

balchua commented 1 year ago

@Dunge do you need an HA cluster? Just in case you do not need, one can join using --worker args during join.

Without knowing what is bothering node2, i am afraid it may blow up the cluster again. Lets see.

Dunge commented 1 year ago

Yes, the whole sale pitch why we spent so much development effort moving to dockers and k8s is stability and resilience, along with providing easy zero downtime updates. Splitting the work load on multiple machines is secondary. Our application don't have that much processing to do or concurrent requests to process, but just can't afford to go down.

Our main server is cloud hosted, so no problem there. But we still have a few customers who require an airgapped system not exposed on internet and local data. So with microk8s, the goal was for them just to provide a few Linux machines, and we would handle installation and management.

The current cluster is our dev/staging environment so is not critical, and was used as a test if microk8s was stable enough to use on customers networks. But with the results here of kubelite crashing and dqlite cpu hog issue, I'm not confident at all.

It's pretty certain it's due to the host vm itself, but unless I have a way to proves it is flawed in some way, I can't confirm that.

And now, I still have the coredns issue to fix, which I don't even know how to start. I have to get my staging environment back on track.

alexgorbatchev commented 1 year ago

have exact same issues, for no reason that I can tell, my 3 node cluster toppled over, same issues with dqlite cpu usage, kube api is barely responsive, dropping connections, not finishing responses, etc... upgraded microk8s from 1.24 to 1.25 with no observable change, still battling this

Dunge commented 1 year ago

@alexgorbatchev glad to see I'm not alone. My experience was filled with multiple different symptoms it's hard to pinpoint if it's all related or not, but based on your comment it seems very similar.

Personally, after restarting it all from scratch and having that new dns issue mentioned above (on a practically empty cluster), I gave up and let it rest as is while I concentrated my efforts on building my AWS EKS cluster (which works perfectly btw). When I came back 2 weeks later I saw that it was still broke for the first week, and then suddenly repaired itself with absolutely no manual intervention and every pods were running with no errors for the second week. So I reinstalled my own services pods, and it is still running up to this day.

I absolutely cannot tell what caused it to go haywire, and I hope it won't happen again, but for now everything works fine again.

Dunge commented 1 year ago

Unfortunately, the problem never got away.

I feel this thread is overloaded with different symptoms of different problems and is hard to maintain a clear idea. I believe it would be better if I open a new more concise one.

My previous "dns not responding" issue was just the iptables reseting on one node after a reboot, not the main problem I'm having.

Currently on mikrok8s 1.27.1, longhorn 1.4.1.

Two principal symptoms: Readiness probes failing for random services (even kube-system ones) maybe once every 6 hours (but no not specific time frame, pretty random), comes back okay straight afterward.

One random node become NodeStatusUnknown about once per week and come back a minute or two later, but cause a lot of pods to try to terminate and migrate and cause the disks replica to move and it end up disrupting the services.

Absolutely no log in any pod or in the microk8s kubelite process before the event. I have tons of Grafana dashboards, including node process exporters and kubernetes events, and I don't see any spikes or anything else weird. The only changes there is is one minute AFTER the event, when it tries to come back.

What are the possible reason microk8s would mark a node as NodeStatusUnknown?

Dunge commented 1 year ago

closed in favor of a new more precise issue