canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
Apache License 2.0
8.27k stars 758 forks source link

microk8s enable nvidia is not complete #4557

Open ACodingfreak opened 1 week ago

ACodingfreak commented 1 week ago


I have a 3 node cluster running microk8s 1.29.4 with a nvidia RTX 3060 in gpu01 node.

$ microk8s.kubectl get nodes
gpu01   Ready    <none>   47m   v1.29.4
mm321   Ready    <none>   57m   v1.29.4
mm322   Ready    <none>   48m   v1.29.4

On executing microk8s enable nvidia on master node (mm321), some of the pods related to gpu operator are stuck in Init state

mm321:~$ microk8s.kubectl get po -A
NAMESPACE                NAME                                                         READY   STATUS                  RESTARTS      AGE
gpu-operator-resources   gpu-feature-discovery-9kgs4                                  0/1     Init:0/1                0             11m
gpu-operator-resources   gpu-operator-999cc8dcc-qjkf2                                 1/1     Running                 0             12m
gpu-operator-resources   gpu-operator-node-feature-discovery-gc-7cc7ccfff8-rhsjx      1/1     Running                 0             12m
gpu-operator-resources   gpu-operator-node-feature-discovery-master-d8597d549-s2dpl   1/1     Running                 0             12m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-czp56             1/1     Running                 0             12m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-gth9b             1/1     Running                 0             12m
gpu-operator-resources   gpu-operator-node-feature-discovery-worker-vz46v             1/1     Running                 0             12m
gpu-operator-resources   nvidia-container-toolkit-daemonset-44fxd                     0/1     Init:CrashLoopBackOff   7 (58s ago)   11m
gpu-operator-resources   nvidia-dcgm-exporter-7cc72                                   0/1     Init:0/1                0             11m
gpu-operator-resources   nvidia-device-plugin-daemonset-jjvnr                         0/1     Init:0/1                0             11m
gpu-operator-resources   nvidia-operator-validator-mcxj2                              0/1     Init:0/4                0             11m
ingress                  nginx-ingress-microk8s-controller-7q8jn                      1/1     Running                 0             49m
ingress                  nginx-ingress-microk8s-controller-pj44d                      1/1     Running                 0             54m
ingress                  nginx-ingress-microk8s-controller-wm9b9                      1/1     Running                 0             48m
kube-system              calico-kube-controllers-77bd7c5b-ksrq6                       1/1     Running                 0             58m
kube-system              calico-node-b8fql                                            1/1     Running                 0             48m
kube-system              calico-node-fm9qz                                            1/1     Running                 0             49m
kube-system              calico-node-j82ml                                            1/1     Running                 0             49m
kube-system              coredns-864597b5fd-8tzxm                                     1/1     Running                 0             58m
kube-system              hostpath-provisioner-756cd956bc-t78f9                        1/1     Running                 1 (49m ago)   54m
metallb-system           controller-5f7bb57799-gs4vm                                  1/1     Running                 0             54m
metallb-system           speaker-5g865                                                1/1     Running                 0             49m
metallb-system           speaker-ld7cc                                                1/1     Running                 0             54m
metallb-system           speaker-sv2nd                                                1/1     Running                 0             48m

What Should Happen Instead?

Pods should not be stuck in init state

Reproduction Steps

  1. Install microk8s 1.29.4 snap in all the nodes
  2. Add nodes mm232 and gpu01 to mm231
  3. in mm231, microk8s enable nvidia

Introspection Report

$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6809/var/kubernetes/backend/localnode.yaml': No such file or directory

WARNING:  Maximum number of inotify user watches is less than the recommended value of 1048576.
          Increase the limit with:
                 echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf
                 sudo sysctl --system
Building the report tarball
  Report tarball is at /var/snap/microk8s/6809/inspection-report-20240630_000040.tar.gz


Can you suggest a fix?


Are you interested in contributing with a fix?


ACodingfreak commented 1 week ago

Workaround: I am using 545 driver in gpu01 device. On downgrading to 535 driver issue seems to be resolved.