Closed Dunge closed 1 year ago
Small follow up,
I did a microk8s stop
on all 3 nodes because they were just crash looping wit errors all the time. A few hours later I came back and tried to pop just the first node back up, and it doesn't come up.
$ microk8s status
microk8s is not running, try microk8s start
$ microk8s start
********************************** about 2-3 minutes before it respond **********************************
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
********************************** about 5 minutes before it respond **********************************
Inspecting dqlite
Inspect dqlite
Building the report tarball
Report tarball is at /var/snap/microk8s/4595/inspection-report-20230321_033014.tar.gz
$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s kubectl get nodes
E0321 03:32:43.569188 861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout
E0321 03:33:03.576104 861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:47630->127.0.0.1:16443: read: connection reset by peer
E0321 03:33:24.569670 861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:50388->127.0.0.1:16443: read: connection reset by peer
E0321 03:33:34.571132 861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout
E0321 03:33:45.981150 861346 memcache.go:238] couldn't get current server API group list: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:54130->127.0.0.1:16443: read: connection reset by peer
error: Get "https://127.0.0.1:16443/api?timeout=32s": net/http: TLS handshake timeout - error from a previous attempt: read tcp 127.0.0.1:54130->127.0.0.1:16443: read: connection reset by peer
I uploaded the new (much smaller) tarball here: https://drive.google.com/file/d/1AEA23MpRMeCeHkfFlh7t_jMOKLXkj4Mf/view?usp=sharing
Might be easier to diagnose (EDIT: Do not analyze this one, this was when a single node was active, and do not show the same issue)
I see a lot of this error in the kubelite log, not sure what it could mean:
microk8s.daemon-kubelite[854750]: W0321 03:24:11.013071 854750 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {
microk8s.daemon-kubelite[854750]: "Addr": "unix:///var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379",
microk8s.daemon-kubelite[854750]: "ServerName": "kine.sock",
microk8s.daemon-kubelite[854750]: "Attributes": null,
microk8s.daemon-kubelite[854750]: "BalancerAttributes": null,
microk8s.daemon-kubelite[854750]: "Type": 0,
microk8s.daemon-kubelite[854750]: "Metadata": null
microk8s.daemon-kubelite[854750]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379: connect: connection refused"
I realized it's normal a one node system doesn't work (no quorum). As soon as I restarted node2 it reports "running" again. But still have the taint (only one the second node?) and the memcache.go errors, and I still have Starting kubelet.
events every few minutes.
Weird thing, even node3 is reported as "Ready" when microk8s is still stopped on it!
Look extremely similar to #3204 and #3735 except I don't quite understand the conclusion other than "I'm moving to another system".
From what I understand, the main issue is the kubelite processes restarting every few minutes. Looking at the logs, it seems to be related to leader election. Why would the cluster spend its time trying to elect a different leader and loop around that?
This is the relevant part of the log that seems to happen often. Something about a timeout updating a lock?
microk8s.daemon-kubelite[53319]: E0320 20:46:08.223079 53319 leaderelection.go:367] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=15s": context deadline exceeded
microk8s.daemon-kubelite[53319]: I0320 20:46:08.223143 53319 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
microk8s.daemon-kubelite[53319]: E0320 20:46:08.223210 53319 controllermanager.go:294] "leaderelection lost"
systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 10.
systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
systemd[1]: Started Service for snap application microk8s.daemon-kubelite.
@MathieuBordere I see you helped a few similar situations in the past, any thoughts?
I managed to remove the taint on node2 manually with kubectl taint nodes jldocker2 node.kubernetes.io/unreachable:NoSchedule-
, didn't help.
Being absolutely out of ideas I tried to turn the cluster back to a real single node with no HA:
microk8s leave
and microk8s stop
on nodes 1 & 3 (the 2nd was the one it say had the last datastore so I left this one as the main)./var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
of node2 has only one entrykubectl delete node
on 1 & 3 because they remained in the list of resourcesAt this point, microk8s status
on node 2 say it is alone, and without HA.
And yet, nothing work! So my problem is not because of the HA cluster?
kubelite actually stopped crash looping, but it didn't resolve anything. But I can't manipulate any resource in the cluster, everything is like frozen. Pods are stuck either creating or terminating and nothing I do change their status (even those in kube-system). Every commands are extremely slow to respond and most timeout. There's no events. I tried to delete pods in longhorn-system and kube-system with --force to see if they would respawn, but nope they just remains there, even worse some are still reporting as running on nodes that doesn't exist anymore.
Maybe a little late, but I then realized k8s-dqlite
use 100% CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1586219 root 20 0 2552.0m 974.7m 18.8m S 105.9 16.4 1:58.30 /snap/microk8s/4595/bin/k8s-dqlite --storage-dir=/var/snap/microk8s/4595/var/kubernetes/backend/ --listen=unix:///var/snap/microk8s/4595/var/kubernetes/backend/kine.sock:12379
Is that because the dqlite database is corrupted?
I tried to stop, run the "reconfigure" command from this article, and restart: https://microk8s.io/docs/restore-quorum Didn't change anything
PLEASE HELP!
@Dunge after turning the cluster into a single node, do you see if k8s-dqlite
is crash looping? This process is not a pod, but a systemd though.
Curious to know what are running in the cluster before it went down?
No, k8s-dqlite is/was using 100% cpu, but not crash looping. What was crash looping was kubelite.
I'm currently in the process of trying to re-attach the nodes (Recover from lost quorum) article. I copied the files, and trying to issue the add-node/join commands now, but it takes like 10 minutes to respond. I guess I'll go with a microk8s reset
soon and start from scratch, hoping my Longhorn data survive.
My cluster have as I mentioned, Longhorn as the volume CSI, and in the default namespace I have a rabbitmq cluster, a redis cluster (both from bitnami Helm charts), a sftp server, a nuget server (liget), and 11 of my own dotnet services, none doing anything really special.
After attempting the add a node back (it took forever), I can't even see the cluster status anymore..
$ microk8s status
Traceback (most recent call last):
File "/snap/microk8s/4985/scripts/wrappers/status.py", line 209, in <module>
enabled, disabled = get_status(available_addons, isReady)
File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 533, in get_status
kube_output = kubectl_get("all,ingress")
File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 215, in kubectl_get
return run(KUBECTL, "get", cmd, "--all-namespaces", die=False)
File "/snap/microk8s/4985/scripts/wrappers/common/utils.py", line 77, in run
result.check_returncode()
File "/snap/microk8s/4985/usr/lib/python3.8/subprocess.py", line 448, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '('/snap/microk8s/4985/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.
I normally do a sudo snap remove microk8s --purge
followed by sudo snap install microk8s...
instead of microk8s reset
. The reset
requires a running kubernetes as it interacts with the api server.
I am trying to understand what could be the reason the cluster went south. Does the node have enough memory/cpu to run your workloads plus microk8s?
Thank, I'll do that.
All three nodes had more than enough, and it ran fine for about 3 months. A single node does not have enough RAM for everything, but the workload didn't even start when it was single anyway (could not mount volumes since longhorn and coredns was also down).
But since this was a long term stability test to determine if the solution was stable enough for production, the result is obviously failed.
Yeah something is really wrong with my node2, I'm not even able to install it anymore!
I did a sudo snap remove microk8s --purge
on all three no problem, then installed it back sucessfully on 1&3. But node2 (which is currentl a 0% cpu usage, 6GB free memory and ton of free disk space, along with practically no other application installed or running on that linux vm, it's practically empty):
sudo snap install microk8s --classic --channel=1.26
error: cannot perform the following tasks:
- Start snap "microk8s" (4595) services (systemctl command [start snap.microk8s.daemon-containerd.service] failed with exit status 1: Job for snap.microk8s.daemon-containerd.service failed because a timeout was exceeded.
See "systemctl status snap.microk8s.daemon-containerd.service" and "journalctl -xeu snap.microk8s.daemon-containerd.service" for details.
)
what the hell? And of course the snap.microk8s.daemon-containerd.service
log only say Failed with result 'timeout'.
After issuing the install command 4 times, it finally went though?
I then had issue where my coredns service and calico pods were all green, but namespace resolution of services inside the cluster would fail. I had to kubectl -n kube-system rollout restart deployment coredns
to fix it. Weird.
Now I still have some issues spawning workload, mostly when it comes to node2, sometime it works, sometime it just crash loop the pods. I'm still unable to find out what this node have different than the other and why it cause problem.
All seems to now be related to a coredns issue. I tried to follow this guide to diagnose: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
From some pod, contacting services works properly, from other it doesn't. Doesn't matter what node they run on. nslookup will timeout on some, while work on other. /etc/resolv.conf point correctly to the IP of the core-dns service on all of them. But from some pods the request doesn't end up in the core-dns pod (as seen by enabling logs on the core-dns configmap). I can ping the pod directly.
I tried to disable and re-enable the dns plugin, it didn't help.
From the guide above, I sometimes see some "SERVFAIL" logs. They mention checking clusterroles. Only difference I see is that the "node get" permission isn't in the core-dns clusterrole, but that's filled in by microk8s, I doubt this is an issue.
Really stumped again, I don't know how I manage to get myself in these situation, this should "just work", but it doesn't.
Is it worth to look at the difference between node 1 and 3 system? Knowing that both 1 & 3 are working. The fact that it requires you to install microk8s on node2 several times isn't a good sign.
Yeah I know, that bothers me too. They are all hosted at the same place (VMs under a vmware hypervisor), same specs, same os, same cpu/ram/disk usage, same softwares installed, same network. I did network and storage benchmarks and they are the same, I don't see anything strange in the syslogs. I really can't understand why microk8s refused to install with timeouts after being purged when the system was practically idle. They are not potatoes, should have power enough for that.
In any case, after installation, recreating a HA cluster went fine, and never saw the dqlite cpu issue again.
Recap:
service.namespace.svc.cluster.local
and I can't understand why.@Dunge do you need an HA cluster? Just in case you do not need, one can join using --worker
args during join.
Without knowing what is bothering node2
, i am afraid it may blow up the cluster again. Lets see.
Yes, the whole sale pitch why we spent so much development effort moving to dockers and k8s is stability and resilience, along with providing easy zero downtime updates. Splitting the work load on multiple machines is secondary. Our application don't have that much processing to do or concurrent requests to process, but just can't afford to go down.
Our main server is cloud hosted, so no problem there. But we still have a few customers who require an airgapped system not exposed on internet and local data. So with microk8s, the goal was for them just to provide a few Linux machines, and we would handle installation and management.
The current cluster is our dev/staging environment so is not critical, and was used as a test if microk8s was stable enough to use on customers networks. But with the results here of kubelite crashing and dqlite cpu hog issue, I'm not confident at all.
It's pretty certain it's due to the host vm itself, but unless I have a way to proves it is flawed in some way, I can't confirm that.
And now, I still have the coredns issue to fix, which I don't even know how to start. I have to get my staging environment back on track.
have exact same issues, for no reason that I can tell, my 3 node cluster toppled over, same issues with dqlite cpu usage, kube api is barely responsive, dropping connections, not finishing responses, etc... upgraded microk8s from 1.24 to 1.25 with no observable change, still battling this
@alexgorbatchev glad to see I'm not alone. My experience was filled with multiple different symptoms it's hard to pinpoint if it's all related or not, but based on your comment it seems very similar.
Personally, after restarting it all from scratch and having that new dns issue mentioned above (on a practically empty cluster), I gave up and let it rest as is while I concentrated my efforts on building my AWS EKS cluster (which works perfectly btw). When I came back 2 weeks later I saw that it was still broke for the first week, and then suddenly repaired itself with absolutely no manual intervention and every pods were running with no errors for the second week. So I reinstalled my own services pods, and it is still running up to this day.
I absolutely cannot tell what caused it to go haywire, and I hope it won't happen again, but for now everything works fine again.
Unfortunately, the problem never got away.
I feel this thread is overloaded with different symptoms of different problems and is hard to maintain a clear idea. I believe it would be better if I open a new more concise one.
My previous "dns not responding" issue was just the iptables reseting on one node after a reboot, not the main problem I'm having.
Currently on mikrok8s 1.27.1, longhorn 1.4.1.
Two principal symptoms: Readiness probes failing for random services (even kube-system ones) maybe once every 6 hours (but no not specific time frame, pretty random), comes back okay straight afterward.
One random node become NodeStatusUnknown
about once per week and come back a minute or two later, but cause a lot of pods to try to terminate and migrate and cause the disks replica to move and it end up disrupting the services.
Absolutely no log in any pod or in the microk8s kubelite process before the event. I have tons of Grafana dashboards, including node process exporters and kubernetes events, and I don't see any spikes or anything else weird. The only changes there is is one minute AFTER the event, when it tries to come back.
What are the possible reason microk8s would mark a node as NodeStatusUnknown
?
closed in favor of a new more precise issue
Summary
Ubuntu 20.04.5 LTS microk8s 1.26.1 installed via snap 3 nodes HA clusters
I came back one morning and my cluster that was working fine the previous day was completely down. Unlike my previous similar experience, this time the disks still have space remaining.
At this point
microk8s inspect
would just freeze at theGathering system information
step after printing that the services were running.I then launched a
microk8s start
, nothing for a few minutes, then exit with no message. status still say not running.I tried to check the logs using
sudo journalctl -u snap.microk8s.daemon-kubelite
but there’s too much stuff, couldn’t find anything relevant.I then rebooted the nodes
sudo reboot
and they actually came back online status Ready.The condition look good:
But there is a Taint preventing any workload to get scheduled on them:
Events on pods:
Events seems to crash the kubelet service on loop:
The "Invalid Disk Capacity" part seems to be normal, because the pods handling the disk CSI (Longhorn) are not running because of the node taint.
Any command I send to microk8s kubectl print some error message about memcache.go:
Any idea what is causing this? From my point of view, nodes seems healthy, ram, cpu and disk are good, networking between them is also functional.
Introspection Report
Here is my microk8s inspect tarball file hosted on my Google Drive: https://drive.google.com/file/d/1LEaCI3O2Y82OI4Q2quo-uIBD-zV8rZBA/view?usp=sharing