Can't enable gpu on Nvidia DGX A100

This is a follow-up to the issue in https://github.com/ubuntu/microk8s/issues/2115

As described in that issue, version 1.21/beta of microk8s seems to work better to enable gpu. However, the instructions that work on Ubuntu 20.04 on a g3.4xlarge instance on AWS don't work on an Nvidia DGX A100 machine.

sudo snap install microk8s --channel=1.21/beta --classic microk8s enable gpu

I get the following pod in Init:CrashLoopBackOff:

ubuntu@blanka:~$ microk8s kubectl get pods -A
NAMESPACE                NAME                                                         READY   STATUS                  RESTARTS   AGE
kube-system              calico-node-m76km                                            1/1     Running                 0          16h
kube-system              coredns-86f78bb79c-bhl86                                     1/1     Running                 0          16h
kube-system              calico-kube-controllers-847c8c99d-fc48p                      1/1     Running                 0          16h
default                  gpu-operator-65d474cc8-g8gdp                                 1/1     Running                 0          16h
default                  gpu-operator-node-feature-discovery-worker-777t6             1/1     Running                 0          15h
default                  gpu-operator-node-feature-discovery-master-dcf999dc8-n5fk2   1/1     Running                 0          15h
gpu-operator-resources   nvidia-driver-daemonset-ndlds                                1/1     Running                 0          15h
gpu-operator-resources   nvidia-container-toolkit-daemonset-xwlbn                     1/1     Running                 0          15h
gpu-operator-resources   nvidia-device-plugin-daemonset-lx5j4                         0/1     Init:CrashLoopBackOff   186        15h

I haven't been able to find useful information yet. Here's a kubectl describe and kubectl logs (with no logs):

ubuntu@blanka:~$ microk8s kubectl describe pod nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Name:         nvidia-device-plugin-daemonset-lx5j4
Namespace:    gpu-operator-resources
Priority:     0
Node:         blanka/10.229.66.23
Start Time:   Mon, 22 Mar 2021 21:16:59 +0000
Labels:       app=nvidia-device-plugin-daemonset
              controller-revision-hash=b479cc95
              pod-template-generation=1
Annotations:  cni.projectcalico.org/podIP: 10.1.234.10/32
              cni.projectcalico.org/podIPs: 10.1.234.10/32
              scheduler.alpha.kubernetes.io/critical-pod: 
Status:       Pending
IP:           10.1.234.10
IPs:
  IP:           10.1.234.10
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:  containerd://580626297564adebebda2a69cc4172fcff6edab3e734ca5ac2134c48798bc88b
    Image:         nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Image ID:      nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      /tmp/vectorAdd
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 23 Mar 2021 12:07:43 +0000
      Finished:     Tue, 23 Mar 2021 12:07:43 +0000
    Ready:          False
    Restart Count:  179
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --mig-strategy=single
      --pass-device-specs=true
      --fail-on-init-error=true
      --device-list-strategy=envvar
      --nvidia-driver-root=/run/nvidia/driver
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-h7b7h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   34m (x173 over 14h)    kubelet  Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Warning  BackOff  4m6s (x4093 over 14h)  kubelet  Back-off restarting failed container

ubuntu@blanka:~$ microk8s kubectl logs nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Error from server (BadRequest): container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-lx5j4" is waiting to start: PodInitializing

There are no nvidia drivers or cuda packages installed on the machine and never were (fresh MAAS deployment):

ubuntu@ip-172-31-14-39:~$ dpkg -l | grep -i -e nvidia -e cuda
ubuntu@ip-172-31-14-39:~$

I installed the nvidia drivers and utils just to run nvidia-smi to get more information:

ubuntu@blanka:~$ nvidia-smi
Tue Mar 23 13:06:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P0    43W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   25C    P0    44W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      Off  | 00000000:47:00.0 Off |                    0 |
| N/A   27C    P0    44W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      Off  | 00000000:4E:00.0 Off |                    0 |
| N/A   26C    P0    40W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      Off  | 00000000:87:00.0 Off |                    0 |
| N/A   30C    P0    42W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      Off  | 00000000:90:00.0 Off |                    0 |
| N/A   29C    P0    45W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      Off  | 00000000:B7:00.0 Off |                    0 |
| N/A   29C    P0    42W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      Off  | 00000000:BD:00.0 Off |                    0 |
| N/A   30C    P0    45W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
|    7   N/A  N/A      4916      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

I found a recipe that works for me. I'm using microk8s on Ubuntu 20.04, here's what I had to do to make this work on both a Nvidia DGX A100 as well as a ProLiant DL380 Gen10 machine with T4 GPU.

Make sure to start on a fresh Ubuntu 20.04 with no nvidia drivers (or apt purge the nvidia packages if you can't)
blacklist the nouveau driver (add modprobe.blacklist=nouveau nouveau.modeset=0 to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and run sudo update-grub and reboot)
[for the A100 only] Install the package nvidia-fabricmanager-460 from the cuda repos (you won't be able to enable the systemd service until the K8s GPU operator has loaded the drivers)
Install microk8s with the 1.21/beta channel (I have version v1.21.0-beta.1)
microk8s enable dns (tip: make sure your DNS is working by launching a test pod and resolving internal and external hostnames)
microk8s enable gpu
[for the A100 only] enable nvidia-fabricmanager (sudo systemctl --now enable nvidia-fabricmanager)

FYI, I did a write up on my adventures with microk8s and MIG on the A100 https://discuss.kubernetes.io/t/my-adventures-with-microk8s-to-enable-gpu-and-use-mig-on-a-dgx-a100/15366

I've followed the recipy and got issues with enabling nvidia-fabricmanager: fabric manager NVIDIA GPU driver interface version 460.91.03 don't match with driver version 460.73.01. Please update with matching NVIDIA driver package. It looks like microk8s enable gpu always loads nvidia drivers 460.73.01. Do you guys know how to harmonize the version of fabric manager with the driver version?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

canonical / microk8s

Can't enable gpu on Nvidia DGX A100 #2119