canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.52k stars 773 forks source link

microk8s.daemon-kubelite produces tons of error logs on all nodes #4681

Open PRNDA opened 1 month ago

PRNDA commented 1 month ago

Summary

We have a 4-node Microk8s HA Cluster running for 2 years, recently we found that the "microk8s.daemon-kubelite" service on all nodes produces tons of error logs like this:

Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.421789  205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.668481  205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.691204  205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.900976  205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
Sep 24 16:27:01 svr02 microk8s.daemon-kubelite[205845]: E0924 16:27:01.949677  205845 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"

I tried to restart this service by running systemctl restart snap.microk8s.daemon-kubelite but it did not help, searched this error message around the web but did not find anything helpful.

All pods seem running fine, and I am still able to update our deployments (but the update progress is much slower than before).

Can someone help me resolve this problem?

Cluster status:

root@svr02:~# microk8s.status
microk8s is running
high-availability: yes
  datastore master nodes: 172.16.40.232:19001 172.16.40.231:19001 172.16.40.233:19001
  datastore standby nodes: 172.16.218.180:19001
addons:
  enabled:
    dns                  # CoreDNS
    ha-cluster           # Configure high availability on the current node
    ingress              # Ingress controller for external access
    metrics-server       # K8s Metrics Server for API access to service metrics
    prometheus           # Prometheus operator for monitoring and logging
    rbac                 # Role-Based Access Control for authorisation
    storage              # Storage class; allocates storage from host directory

microk8s inspect:

root@svr02:~# microk8s inspect
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting juju
  Inspect Juju
Inspecting kubeflow
  Inspect Kubeflow
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gz
louiseschmidtgen commented 1 month ago

Hello @PRNDA,

thank you for reporting your issue with us.

Could you please upload the inspection report that you have created under /var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gz please? With this information we can better assist you to resolve the issue.

Thank you!

PRNDA commented 1 month ago

Hello @PRNDA,

thank you for reporting your issue with us.

Could you please upload the inspection report that you have created under /var/snap/microk8s/4916/inspection-report-20240924_162747.tar.gz please? With this information we can better assist you to resolve the issue.

Thank you!

I created this inspection report yesterday, but I found some sensitive information in the logs, so I decided not to upload it here, Is there a way that I can send it to you privately?

louiseschmidtgen commented 1 month ago

Hi @PRNDA,

how would you prefer to share it? Would you be able to upload the inspection report somewhere we could pull it from?

PRNDA commented 1 month ago

Hi @louiseschmidtgen ,

I created a private repo here, and uploaded the inspection file into this repo, could you please accept my repo invitation first and then download this inspection file?

Sorry for the inconvenience.

louiseschmidtgen commented 1 month ago

Hello @PRNDA ,

I have received your invitation and have access to the logs.

Thank you for sharing the inspection report, I will be having a look shortly.

louiseschmidtgen commented 1 month ago

Linking this issue as possibly related: https://github.com/canonical/microk8s/issues/4293

louiseschmidtgen commented 1 month ago

Hello @PRNDA,

are you able to reproduce this issue on a more recent MicroK8s snap? You are currently running on v1.23 which is out of support.

With kind regards, Louise

PRNDA commented 1 month ago

Hello @PRNDA,

are you able to reproduce this issue on a more recent MicroK8s snap? You are currently running on v1.23 which is out of support.

With kind regards, Louise

I'm afraid I can not, this is a production system, and I'm not allowed to upgrade it.

ClaudZen commented 1 month ago

Have you tried deleting Calico-Node pods?

PRNDA commented 1 month ago

Have you tried deleting Calico-Node pods?

Will this interrupt the running pods?

ClaudZen commented 1 month ago

Have you tried deleting Calico-Node pods?

Will this interrupt the running pods?

Deleting the Calico-Node pods should not interrupt the execution of other pods, as Kubernetes will automatically re-schedule new Calico-Node pods to maintain network connectivity. However, there might be a temporary disruption in pod networking while the new Calico pods start.

PRNDA commented 1 month ago

Have you tried deleting Calico-Node pods?

Will this interrupt the running pods?

Deleting the Calico-Node pods should not interrupt the execution of other pods, as Kubernetes will automatically re-schedule new Calico-Node pods to maintain network connectivity. However, there might be a temporary disruption in pod networking while the new Calico pods start.

There might be a temporary disruption in pod networking

That's what I'm worried about, as this cluster is running several online systems, I don't want them to be affected.