Master node unresponsive after reboot

streamliner18 commented 2 years ago

Summary

I have a microk8s cluster deployment on x86 (Ubuntu 20.04, Snap Microk8s v1.24.0). I configured it with two slave nodes and they have been running correctly for weeks. However, as soon as I restarted the master node one day the server would stop responding - Kubectl completely unresponsive, microk8s start hangs at Started.. A look at the kubelite service log reveals suspicious log patterns:

Jun 05 00:30:35 kube1 microk8s.daemon-kubelite[22988]: W0605 00:30:35.724546   22988 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {unix:///var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379 kine.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379: connect: connection refused". Reconnecting...
Jun 05 00:30:38 kube1 microk8s.daemon-kubelite[22988]: W0605 00:30:38.471001   22988 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {unix:///var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379 kine.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379: connect: connection refused". Reconnecting...
Jun 05 00:30:39 kube1 microk8s.daemon-kubelite[22988]: W0605 00:30:39.138531   22988 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {unix:///var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379 kine.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/3272/var/kubernetes/backend/kine.sock:12379: connect: connection refused". Reconnecting...

Port 12379 points at etcd but I couldn't find anything wrong with it..

What Should Happen Instead?

Kubernetes should start.

Reproduction Steps

I did not try to repro, but here is what I did--

Set up the cluster
Been using the Kubernetes cluster for weeks.
Reboot master

Attached inspect report

Notably, "Inspecting Kubernetes" took almost 5 minutes to complete.

inspection-report-20220605_003610.tar.gz

Can you suggest a fix?

no

Are you interested in contributing with a fix?

no

streamliner18 commented 2 years ago

Out of urgency and a last-ditch attempt I was actually able to work around and restore the cluster but I don't understand what's happening... Here are my steps:

Remove all slave nodes (microk8s leave)
Edited /var/snap/microk8s/current/args/kube-server -- commented out the etcd argument according to https://github.com/canonical/microk8s/pull/2864/files (I don't understand why)
microk8s stop, microk8s start (it will fail).
Edited that file back with etcd argument uncommented
Reboot
Now microk8s would start normally....

Attached the tarball post-restore to see if it provides more clues to what happened. S inspection-report-20220605_014606.tar.gz .

djjudas21 commented 1 year ago

Identical behaviour seen here on Ubuntu 22.04 LTS and MicroK8s v1.26.0 after an unclean shutdown. Mine was a 4-node cluster with HA mode but kube05 (where the inspection report was run) is the only running node+master at the moment.

inspection-report-20230120_195306.tar.gz

jonathan@kube05:~$ microk8s start
jonathan@kube05:~$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.

jonathan@kube05:~$ journalctl -f -u snap.microk8s.daemon-kubelite
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]: W0120 20:00:15.237644    8558 logging.go:59] [core] [Channel #4 SubChannel #6] grpc: addrConn.createTransport failed to connect to {
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "Addr": "unix:///var/snap/microk8s/4390/var/kubernetes/backend/kine.sock:12379",
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "ServerName": "kine.sock",
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "Attributes": null,
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "BalancerAttributes": null,
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "Type": 0,
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]:   "Metadata": null
Jan 20 20:00:15 kube05 microk8s.daemon-kubelite[8558]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/4390/var/kubernetes/backend/kine.sock:12379: connect: connection refused"

I'm going to attempt the workaround in https://github.com/canonical/microk8s/issues/3204#issuecomment-1146721394 to restore the cluster

Marahin commented 1 year ago

@djjudas21 did you manage to get it working?

djjudas21 commented 1 year ago

@Marahin. Unfortunately I was never able to repair my cluster, so in the end I destroyed it, recreated new, and restored my PVCs from backup. It worked fine for 3 weeks but yesterday it broke again (#3735) and I'm trying to figure out how to fix it without having to restore from scratch again. I'm pretty sure it's related to dqlite quorum, but it's pretty bad that it's happened twice.

jhughes2112 commented 1 year ago

This has happened to me about 10 times over the past couple of years. I have a battery backed up server rack that covers only the control nodes and allow my worker nodes to power off in the case of a loss of electricity. It's common I'm out for longer than the UPS can hold out. For some reason, one or two of my control nodes regularly will fail. The HA shifts to a standby worker node and I have time to fully wipe and reinstall one of my control nodes, which then takes back control.

Yes, this sucks. A lot. I don't have any idea why it breaks, but you have a 30-40% chance anytime you hard-power-off a box that it won't come back up again.

Marahin commented 1 year ago

@djjudas21 @jhughes2112 Neither I was able to find the culprit and restore the cluster. It actually froze when trying to add the node that was reformatted. Since I couldn't find any solutions to the problem, I just migrated to k3s and everything is working smoothly so far.

jhughes2112 commented 1 year ago

I managed to rescue a cluster that lost quorum tonight, which is why I was on this thread. If you lose HA status and need to recover from kubectl and microk8s hanging, just microk8s stop then follow the parts of this page to remove the other HA nodes from microk8s' quorum list (in cluster.yaml), then push those changes to dqlite with the script a little further down the page. Seems like it should be able to detect a loss of HA status and self-heal like this... https://microk8s.io/docs/restore-quorum

K3s is being used on another project at my company. That's the only other contender I'd consider tbh.

djjudas21 commented 1 year ago

Thanks @jhughes2112, I think I'd be more confident attempting to regain quorum next time. Unfortunately in #3735 I lost quorum in a pretty bad way and then made it worse because I didn't know how to tackle it (I tried to microk8s leave rather than microk8s stop on the other nodes).

My cluster has broken like this twice now, and I have never been able to figure out a root cause. The nodes did not lose power or network.

In my situation, I actually lost data because I was using OpenEBS cStor as hyperconverged storage, i.e. replicas of volumes are stored on local block devices across your nodes. Turns out the way cStor handles its own replica placement and failover is by relying on the kube API, so when MicroK8s loses quorum, so does your storage engine :see_no_evil:

I've had to start my cluster from scratch but I haven't found a way of adopting my existing cStor replicas into a new cluster, and OpenEBS "support" has been zero help. So I bought a NAS, and have restored most of my data from backup. At least I can re-adopt PVs from a NAS into a vanilla cluster. Lesson learned.

At work I use Red Hat OpenShift as the Kubernetes of choice, which is obviously way overkill for a home setup. Today they asked me to look into Rancher and it actually looks pretty good. It uses k3s underneath but adds some nice features. Will definitely consider it at home next time I have to smash everything up and start again.

jhughes2112 commented 1 year ago

@djjudas21 I had a similar problem when I first started out using an older version of Rancher and Longhorn. Longhorn uses local disks, and I blew away nodes (not knowing what I was doing). Very frustrating. I threw FreeNAS on a separate box with a bunch of disks and use a package called Democratic CSI as my StorageClass that lets me serve NFS volumes--very handy when I need to mount them from a windows or linux box from anywhere, since NFS allows file sharing. Not production-worthy performance, but very convenient. I may need to look into Rancher again, now that k3s is out. Seems like a lot of us stub our toes the same way. ;-)

djjudas21 commented 1 year ago

@jhughes2112 yes I'm using TrueNAS with Democratic CSI too. The guy who maintains Democratic has been super helpful when I've had questions etc.

I just wish I'd learnt these hard lessons at work with customer data, rather than at home with my own data 😂

doctorpangloss commented 1 year ago

I am also experiencing this issue

ballerabdude commented 1 year ago

I had to deal with this yesterday. I lost power at my house and my networking gear restarted. When that happened, my cluster lost quorum. At the time, I didn't know that since nothing was indicating this. My error was the same as everyone else's.

Jul 08 00:16:53 homeserver microk8s.daemon-kubelite[1914]: W0708 00:16:53.315347    1914 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {unix:///var/snap/microk8s/5137/var/kube>

When I began to debug the issue, I did see the documentation from MicroK8s about restoring quorum (https://microk8s.io/docs/restore-quorum). I didn't think it applied to me since I only had two master nodes, so I ignored it and just performed the backup part: tar -cvf backup.tar /var/snap/microk8s/current/var/kubernetes/backend.

Here is what my cluster yaml looked like on my main master (I call it my main master since I started my cluster with it):

- Address: X.X.X.20:19001
  ID: 3297041220608546238
  Role: 0
- Address: X.X.X.10:19001
  ID: 13154991381401386068
  Role: 2
- Address: X.X.X.239:19001
  ID: 4195465845258971620
  Role: 0

X.X.X.239 is my other master and X.X.X.10 is one of my worker-only nodes. I was surprised to see that it only showed one worker node, so I made a mental note of that. I think initially it was a master and I converted it to a worker.

After several restart attempts failed, I decided to spin up a new VM to host another cluster, hoping that I could use the backup I created to somehow recover the X.X.X.20 master node.

When I tried to restore my backup.tar onto the new working cluster, I was able to reproduce the issue. This excited me because I knew the backup worked. So, I returned to my main master node and followed the instructions from the MicroK8s website on recovering quorum. I modified the cluster.yaml file to look like this:

- Address: X.X.X.20:19001
  ID: 3297041220608546238
  Role: 0

Then, I ran the reconfigure command. I did this because I didn't care about the other nodes on that list. All of my data was safe in a Ceph cluster hosted by my broken cluster, and none of those on this list were storage nodes. At that time, all my worker nodes were off except for one (it was a bare metal and I didn't feel like shutting it down mostly because I forgot), so I started MicroK8s on the master node that I performed the fix on and I've never been happier in my life. Kubernetes started and when I ran microk8s kubectl get nodes, I saw all my nodes — even the worker node that was still on — in a ready state. Then, I turned on the rest of my worker nodes and my cluster was back to life and all data was safe. As for the other master node, it is still shut down and I will eventually reset MicroK8s on it and add it back in as a new master node.

My lesson learned here is that I really need to buy a battery backup for my networking gear.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juxuny commented 5 months ago

I am also experiencing this issue. I am running Microk8s 1.26 on my Ubuntu 22.04 desktop with ZFS. In one day, my machine shutdown suddenly when I am removing a 7TiB file. Then I got the error:

5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]: W0509 17:24:08.703709    6446 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "Addr": "unix:///var/snap/microk8s/6571/var/kubernetes/backend/kine.sock:12379",
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "ServerName": "kine.sock",
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "Attributes": null,
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "BalancerAttributes": null,
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "Type": 0,
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]:   "Metadata": null
5月 09 17:24:08 yuanjie-Super-Server microk8s.daemon-kubelite[6446]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/6571/var/kubernetes/backend/kine.sock:12379: connect: connection refused"

jhughes2112 commented 5 months ago

Hey all. I abandoned microk8s after two years and tried k3s and had a much worse experience (there's a continuous increase in CPU that eventually chokes your control nodes).

What both of these packages have in common in Kine for the data store. I was heavily involved trying to help diagnose a cpu spike problem on microk8s that had to do with Kine and the way it handled when a node was using too much async io. It basically fails and makes the control plane unresponsive.

Kine is unfixable. Unfixable. Drop it like it's hot. Unfixable. I switched to k0s about six months ago and it's been fine. Power fluctuations, random disconnections, etc, have not had to rebuild my cluster. (I'm not doing HA so maybe dodging a bullet there.) If you decide to stay on microk8s (which I loved) find a way to stop using Kine as the data store on the control plane. Less easy but will work. Etcd is battle tested. Gl:hf

juxuny commented 5 months ago

Hey all. I abandoned microk8s after two years and tried k3s and had a much worse experience (there's a continuous increase in CPU that eventually chokes your control nodes).

What both of these packages have in common in Kine for the data store. I was heavily involved trying to help diagnose a cpu spike problem on microk8s that had to do with Kine and the way it handled when a node was using too much async io. It basically fails and makes the control plane unresponsive.

Kine is unfixable. Unfixable. Drop it like it's hot. Unfixable. I switched to k0s about six months ago and it's been fine. Power fluctuations, random disconnections, etc, have not had to rebuild my cluster. (I'm not doing HA so maybe dodging a bullet there.) If you decide to stay on microk8s (which I loved) find a way to stop using Kine as the data store on the control plane. Less easy but will work. Etcd is battle tested. Gl:hf

Thanks for your reply. Finally, I rebuild my cluster. I think I should switch to k0s too.

canonical / microk8s