canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.43k stars 770 forks source link

microk8s is not running - on a 4 node Rasp Pi 3 B+ cluster #4449

Open steentottrup opened 7 months ago

steentottrup commented 7 months ago

Summary

I've just install microk8s on 4 Rasp Pi 3 B+. They were installed with Ubuntu 22.04.4 64 bit server OS. The first 3 nodes are joined with the control plane etc. 4th node is just a worker. Node 1 boots off a USB HDD, other 3 are on SD cards. When I try to get status, all I get back is this text:

"microk8s is not running. Use microk8s inspect for a deeper inspection."

Trying to enable dns and storage etc. fails, here the output from 'microk8s enable dns':

Traceback (most recent call last): File "/snap/microk8s/6565/scripts/wrappers/enable.py", line 41, in enable(prog_name="microk8s enable") File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/snap/microk8s/6565/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(args, **kwargs) File "/snap/microk8s/6565/scripts/wrappers/enable.py", line 37, in enable xable("enable", addons) File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 470, in xable protected_xable(action, addon_args) File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 498, in protected_xable unprotected_xable(action, addon_args) File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 514, in unprotected_xable enabled_addons_info, disabled_addons_info = get_status(available_addons_info, True) File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 566, in get_status kube_output = kubectl_get("all,ingress") File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 248, in kubectl_get return run(KUBECTL, "get", cmd, "--all-namespaces", die=False) File "/snap/microk8s/6565/scripts/wrappers/common/utils.py", line 69, in run result.check_returncode() File "/snap/microk8s/6565/usr/lib/python3.8/subprocess.py", line 448, in check_returncode raise CalledProcessError(self.returncode, self.args, self.stdout, subprocess.CalledProcessError: Command '('/snap/microk8s/6565/microk8s-kubectl.wrapper', 'get', 'all,ingress', '--all-namespaces')' returned non-zero exit status 1.

What Should Happen Instead?

No errors when I joined the nodes together, so I was hoping everything was working and I could start putting workloads/services in the cluster.

Reproduction Steps

I've installed the microk8s a few times now, first on 2 nodes, and latest on 4 to see if the number og nodes was the issue. Same thing every time. This is what I'm doing on the freshly install Ubuntu 22.04.4:

sudo apt update && sudo apt upgrade -y && sudo reboot

sudo nano /boot/firmware/cmdline.txt Adding 'cgroup_enable=memory cgroup_memory=1' to the file

sudo apt install linux-modules-extra-raspi

sudo snap install microk8s --classic

sudo usermod -a -G microk8s rasppi sudo chown -f -R rasppi ~/.kube

microk8s status --wait-ready

The last command seems to never return/end.

Introspection Report

Inspecting system Inspecting Certificates Inspecting services Service snap.microk8s.daemon-cluster-agent is running Service snap.microk8s.daemon-containerd is running Service snap.microk8s.daemon-kubelite is running Service snap.microk8s.daemon-k8s-dqlite is running Service snap.microk8s.daemon-apiserver-kicker is running Copy service arguments to the final report tarball Inspecting AppArmor configuration Gathering system information Copy processes list to the final report tarball Copy disk usage information to the final report tarball Copy memory usage information to the final report tarball Copy server uptime to the final report tarball Copy openSSL information to the final report tarball Copy snap list to the final report tarball Copy VM name (or none) to the final report tarball Copy current linux distribution to the final report tarball Copy asnycio usage and limits to the final report tarball Copy inotify max_user_instances and max_user_watches to the final report tarball Copy network configuration to the final report tarball Inspecting kubernetes cluster Inspect kubernetes cluster Inspecting dqlite Inspect dqlite

Building the report tarball Report tarball is at /var/snap/microk8s/6565/inspection-report-20240303_075153.tar.gz

inspection-report-20240303_075153.tar.gz

ktsakalozos commented 6 months ago

Hi @steentottrup,

The error in the logs causing k8s to crashloop is:

Mar 03 07:50:01 pimk8s01 microk8s.daemon-kubelite[2282]: F0303 07:50:01.502923    2282 daemon.go:46] Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory

I think you are missing sudo apt install linux-modules-extra-raspi. Have a look at this docs page: https://microk8s.io/docs/install-raspberry-pi

steentottrup commented 6 months ago

Thank you for getting back to me. I'm using a "playbook" to get them all installed properly, and was pretty sure I already had installed the raspi extras.

Just to make sure I ran it again on all 4 nodes.

microk8s-3 microk8s microk8s-1 microk8s-2

It doesn't seem to be the problem. I'll dig around now that you have located the issue for me.

bartecargo commented 5 months ago

@steentottrup did you ever get to the bottom of this?

steentottrup commented 5 months ago

No, I'm no closer to a solution. I'm not really a Linux/Ubuntu expert, so I've look at the logs, but haven't found the actual problem (or solution) yet.

bartecargo commented 5 months ago

I'm experiencing the same problem, but on Ubuntu 22.04. Someone else also appears to have encountered it with a clean install of the same operating system:

It appears that I've been able to temporarily get the node back up by running the following:

modprobe nf_conntrack
nickbrennan1 commented 5 months ago

@bartecargo that's a great spot thanks. I've been running stable on Ubuntu 20.04.5 LTS for ~18 months, took microk8s up to v1.28 ~6 months ago without issue. Took v1.30 last week, saw the same error stack ~4 days after upgrading:

"Proxy exited open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"

Rebuilt microk8s on Monday @ v1.30, just happened again. Bart's modprobe resolved for me

matpen commented 3 months ago

I can confirm the above after upgrading to microk8s 1.30/stable. Switching to 1.30/edge as suggested in https://github.com/canonical/microk8s/issues/4361 does not help.

The modprobe command posted in https://github.com/canonical/microk8s/issues/4449#issuecomment-2048313991, followed by microk8s start will instead fix the problem. To make the change permanent, follow the instructions in this SO answer.

The same is also outlined in this blog post and appears to be a microk8s shortcoming. If someone of the dev team sees this, they might want to investigate.

neoaggelos commented 2 months ago

Hi @matpen

So, MicroK8s should load br_netfilter before the services start in https://github.com/canonical/microk8s/blob/5403f433324281517e95231c61f2baa0e3b2573b/microk8s-resources/wrappers/run-kubelite-with-args#L214-L226

Would you mind sharing some logs from your machine, after the reboot? Can you check if there are any log lines like the ones shown? An inspection report would also do wonders to see what might be up.

For example, I wonder if this code is running early in the boot process, then br_netfilter fails to load and the code just proceeds

matpen commented 2 months ago

Hi @neoaggelos,

Thank you for following up on this.

Would you mind sharing some logs from your machine, after the reboot? Can you check if there are any log lines like the ones shown?

Here is a grep for br_netfilter. The second set of logs on July 6th is related to the reboot for which I wrote my comment above.

Filtered logs

`sudo grep br_netfilter /var/log/syslog.1` ``` Jul 2 17:26:36 kube02 microk8s.daemon-kubelite[2706]: + /sbin/modprobe br_netfilter Jul 2 17:26:37 kube02 kernel: [ 205.575330] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this. Jul 2 17:26:37 kube02 microk8s.daemon-kubelite[2706]: + echo 'Successfully loaded br_netfilter module.' Jul 2 17:26:37 kube02 microk8s.daemon-kubelite[2706]: Successfully loaded br_netfilter module. Jul 6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: + /sbin/modprobe br_netfilter Jul 6 11:50:54 kube02 kernel: [ 231.365695] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this. Jul 6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: + echo 'Successfully loaded br_netfilter module.' Jul 6 11:50:54 kube02 microk8s.daemon-kubelite[3267]: Successfully loaded br_netfilter module. ```

This being a production machine, I am hesitant in sharing more info on the open channel, but I have a slice around the time where microk8s starts which might be useful. It looks anyway like the module is loaded properly.

Unfiltered logs

`sudo grep 'Jul 6 11:50' /var/log/syslog.1` ``` Jul 6 11:50:21 kube02 systemd[1]: Created slice User Slice of UID 10001. Jul 6 11:50:21 kube02 systemd[1]: Starting User Runtime Directory /run/user/10001... Jul 6 11:50:21 kube02 systemd[1]: Finished User Runtime Directory /run/user/10001. Jul 6 11:50:21 kube02 systemd[1]: Starting User Manager for UID 10001... Jul 6 11:50:22 kube02 systemd[2693]: Queued start job for default target Main User Target. Jul 6 11:50:22 kube02 systemd[2693]: Created slice User Application Slice. Jul 6 11:50:22 kube02 systemd[2693]: Reached target Paths. Jul 6 11:50:22 kube02 systemd[2693]: Reached target Timers. Jul 6 11:50:22 kube02 systemd[2693]: Starting D-Bus User Message Bus Socket... Jul 6 11:50:22 kube02 systemd[2693]: Listening on GnuPG network certificate management daemon. Jul 6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers). Jul 6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Jul 6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Jul 6 11:50:22 kube02 systemd[2693]: Listening on GnuPG cryptographic agent and passphrase cache. Jul 6 11:50:22 kube02 systemd[2693]: Listening on debconf communication socket. Jul 6 11:50:22 kube02 systemd[2693]: Listening on REST API socket for snapd user session agent. Jul 6 11:50:22 kube02 systemd[2693]: Listening on D-Bus User Message Bus Socket. Jul 6 11:50:22 kube02 systemd[2693]: Reached target Sockets. Jul 6 11:50:22 kube02 systemd[2693]: Reached target Basic System. Jul 6 11:50:22 kube02 systemd[2693]: Reached target Main User Target. Jul 6 11:50:22 kube02 systemd[2693]: Startup finished in 206ms. Jul 6 11:50:22 kube02 systemd[1]: Started User Manager for UID 10001. Jul 6 11:50:22 kube02 systemd[1]: Started Session 1 of User ansible. Jul 6 11:50:36 kube02 systemd[2693]: Started D-Bus User Message Bus. Jul 6 11:50:36 kube02 dbus-daemon[2816]: [session uid=10001 pid=2816] AppArmor D-Bus mediation is enabled Jul 6 11:50:36 kube02 systemd[2693]: Started snap.microk8s.microk8s-e468b3be-a472-49f4-bc7a-632f1224bdfd.scope. Jul 6 11:50:40 kube02 systemd[2693]: Started snap.microk8s.microk8s-49e263f6-23d1-4b0d-ae2c-c71f4d48ad98.scope. Jul 6 11:50:40 kube02 dbus-daemon[1963]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.11' (uid=0 pid=1975 comm="/usr/lib/snapd/snapd " label="unconfined") Jul 6 11:50:40 kube02 systemd[1]: Starting Time & Date Service... Jul 6 11:50:41 kube02 dbus-daemon[1963]: [system] Successfully activated service 'org.freedesktop.timedate1' Jul 6 11:50:41 kube02 systemd[1]: Started Time & Date Service. Jul 6 11:50:41 kube02 systemd[1]: Reloading. Jul 6 11:50:41 kube02 systemd[1]: Configuration file /run/systemd/system/netplan-ovs-cleanup.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway. Jul 6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-apiserver-kicker. Jul 6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-apiserver-proxy. Jul 6 11:50:41 kube02 systemd[1]: Started Service for snap application microk8s.daemon-cluster-agent. Jul 6 11:50:41 kube02 systemd[1]: Starting Service for snap application microk8s.daemon-containerd... Jul 6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + source /snap/microk8s/6876/actions/common/utils.sh Jul 6 11:50:41 kube02 microk8s.daemon-containerd[2932]: ++ [[ /snap/microk8s/6876/run-containerd-with-args == \/\s\n\a\p\/\m\i\c\r\o\k\8\s\/\6\8\7\6\/\a\c\t\i\o\n\s\/\c\o\m\m\o\n\/\u\t\i\l\s\.\s\h ]] Jul 6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + use_snap_env Jul 6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + export PATH=/snap/microk8s/6876/usr/bin:/snap/microk8s/6876/bin:/snap/microk8s/6876/usr/sbin:/snap/microk8s/6876/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin Jul 6 11:50:41 kube02 microk8s.daemon-containerd[2932]: + PATH=/snap/microk8s/6876/usr/bin:/snap/microk8s/6876/bin:/snap/microk8s/6876/usr/sbin:/snap/microk8s/6876/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin ```

From the above, it looks to me like microk8s correctly loads the module. However, I am also quite confident of what I reported in https://github.com/canonical/microk8s/issues/4449#issuecomment-2211755202. The situation was as follow:

So there is a slight chance that the combination "upgrade to edge + modprobe" somehow fixed the problem.

swagfin commented 2 weeks ago

You can set this to be done automatically during boot The commands also checks if the config already exists

sudo modprobe nf_conntrack && grep -qxF 'nf_conntrack' /etc/modules || echo 'nf_conntrack' | sudo tee -a /etc/modules