canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.26k stars 758 forks source link

snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE #4536

Open nobuto-m opened 1 month ago

nobuto-m commented 1 month ago

Summary

During the Sunbeam deployment described in: https://microstack.run/docs/multi-node-maas Microk8s restarted and it caused Juju charm hook failures since K8s API endpoint wasn't available.

More context is available at: https://bugs.launchpad.net/snap-openstack/+bug/2067451

$ snap list microk8s Name Version Rev Tracking Publisher Notes microk8s v1.28.7 6532 1.28-strict/stable canonical✓ -

May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: E0529 06:08:00.604719   13121 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=15s": context deadline exceeded
May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: I0529 06:08:00.604859   13121 leaderelection.go:285] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
May 29 06:08:00 machine-1 microk8s.daemon-kubelite[13121]: E0529 06:08:00.604948   13121 controllermanager.go:302] "leaderelection lost"
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 13min 7.868s CPU time.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.
May 29 06:08:01 machine-1 systemd[1]: Stopped Service for snap application microk8s.daemon-kubelite.
May 29 06:08:01 machine-1 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 13min 7.868s CPU time.
May 29 06:08:01 machine-1 systemd[1]: Started Service for snap application microk8s.daemon-kubelite.

What Should Happen Instead?

microk8s shouldn't be restarted once the initial k8s cluster deployment is done.

Reproduction Steps

  1. Follow the steps in: https://microstack.run/docs/multi-node-maas
  2. you may see random Juju k8s charm hook failures by inaccessible k8s endpoint.

ops.model.ModelError: ERROR cannot ensure service account "unit-keystone-mysql-2": Post "https://192.168.151.102:16443/api/v1/namespaces/openstack/serviceaccounts": read tcp 192.168.151.101:33354->192.168.151.102:16443: read: connection reset by peer

Fwiw, 30 charms are deployed on top of one microk8s node. And CPU was heavily used by charm hook executions at the same time with the host processes such as microk8s including dqlite.

Screenshot 2024-05-29 at 18-32-34 Node Exporter Full - Dashboards - Grafana 06:08:01 is 15:08:01 in this graph where CPU started saturated.

cinder-2 cinder-ceph-2 cinder-ceph-mysql-router-2 cinder-mysql-2 cinder-mysql-router-1 glance-2 glance-mysql-1 glance-mysql-router-2 horizon-1 horizon-mysql-1 horizon-mysql-router-2 keystone-2 keystone-mysql-2 keystone-mysql-router-2 neutron-1 neutron-mysql-1 neutron-mysql-router-1 nova-2 nova-api-mysql-router-1 nova-cell-mysql-router-1 nova-mysql-2 nova-mysql-router-1 ovn-central-2 ovn-relay-1 placement-2 placement-mysql-2 placement-mysql-router-1 rabbitmq-1 traefik-1 traefik-public-1

Introspection Report

inspection-report-20240529_091424.tar.gz

sunbeam-inspection-report-20240529_071507.tar.gz

nobuto-m commented 1 month ago

Ah there is no option to inject a custom value for --leader-elect-lease-duration using the microk8s charm to see if it mitigates the issue. https://microk8s.io/docs/ref-launch-config