snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE

Summary

During the Sunbeam deployment described in: https://microstack.run/docs/multi-node-maas Microk8s restarted and it caused Juju charm hook failures since K8s API endpoint wasn't available.

More context is available at: https://bugs.launchpad.net/snap-openstack/+bug/2067451

$ snap list microk8s Name Version Rev Tracking Publisher Notes microk8s v1.28.7 6532 1.28-strict/stable canonical✓ -

May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: E0529 06:08:00.604719   13121 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=15s": context deadline exceeded
May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: I0529 06:08:00.604859   13121 leaderelection.go:285] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: E0529 06:08:00.604948   13121 controllermanager.go:302] "leaderelection lost"
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 13min 7.868s CPU time.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.
May 29 06:08:01 machine-1 systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 13min 7.868s CPU time.
May 29 06:08:01 machine-1 systemd[1]: Started Service for snap application microk8s.daemon-kubelite.

What Should Happen Instead?

microk8s shouldn't be restarted once the initial k8s cluster deployment is done.

Reproduction Steps

Follow the steps in: https://microstack.run/docs/multi-node-maas
you may see random Juju k8s charm hook failures by inaccessible k8s endpoint.

ops.model.ModelError: ERROR cannot ensure service account "unit-keystone-mysql-2": Post "https://192.168.151.102:16443/api/v1/namespaces/openstack/serviceaccounts": read tcp 192.168.151.101:33354->192.168.151.102:16443: read: connection reset by peer

Fwiw, 30 charms are deployed on top of one microk8s node. And CPU was heavily used by charm hook executions at the same time with the host processes such as microk8s including dqlite.

Screenshot 2024-05-29 at 18-32-34 Node Exporter Full - Dashboards - Grafana 06:08:01 is 15:08:01 in this graph where CPU started saturated.

cinder-2 cinder-ceph-2 cinder-ceph-mysql-router-2 cinder-mysql-2 cinder-mysql-router-1 glance-2 glance-mysql-1 glance-mysql-router-2 horizon-1 horizon-mysql-1 horizon-mysql-router-2 keystone-2 keystone-mysql-2 keystone-mysql-router-2 neutron-1 neutron-mysql-1 neutron-mysql-router-1 nova-2 nova-api-mysql-router-1 nova-cell-mysql-router-1 nova-mysql-2 nova-mysql-router-1 ovn-central-2 ovn-relay-1 placement-2 placement-mysql-2 placement-mysql-router-1 rabbitmq-1 traefik-1 traefik-public-1

Introspection Report

inspection-report-20240529_091424.tar.gz

sunbeam-inspection-report-20240529_071507.tar.gz

canonical / microk8s