Unable to use H100 gpu with microk8s

lolwww commented 3 days ago

Summary

I am running microk8s v1.28.9 on azure VM with H100. Nvidia gpu-operator:v23.9.1. Nvidia-smi shows gpu is ok:

ubuntu@cgpu-test:~$ nvidia-smi
Tue Jun 25 13:46:41 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000001:00:00.0 Off |                    0 |
| N/A   35C    P0             61W /  400W |       1MiB /  95830MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |

microk8s enable gpu also passes ok. I have also tried disable-enable gpu addon and reboot the host with no result.

microk8s enable gpu
...
NAME: gpu-operator
LAST DEPLOYED: Tue Jun 25 10:35:04 2024
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

However if I run a simple GPU test it fails (see below)

What Should Happen Instead?

GPU should work as expected.

Reproduction Steps

1. have H100 VM on azure with nvidia-drivers
2. sudo snap install kubectl --classic
3. mkdir -p ~/.kube
4. sudo snap install microk8s --classic --channel=1.28/stable
5. sudo usermod -a -G microk8s ubuntu
6. sudo chown -R ubuntu ~/.kube
7. newgrp microk8s # or restart the console
8. microk8s enable dns hostpath-storage ingress metallb:10.64.140.43-10.64.140.49 rbac
9. microk8s config > ~/.kube/config
10. microk8s enable gpu
11. kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF
14.kubectl describe po/cuda-vector-add
Status:           Pending
  Warning  FailedScheduling  4m26s (x13 over 54m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

15. kubectl logs pod/nvidia-container-toolkit-daemonset-sp4zd  -n gpu-operator-resources
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server (BadRequest): container "nvidia-container-toolkit-ctr" in pod "nvidia-container-toolkit-daemonset-sp4zd" is waiting to start: PodInitializing

16.kubectl logs pod/nvidia-container-toolkit-daemonset-sp4zd -c driver-validation  -n gpu-operator-resources
time="2024-06-25T13:55:10Z" level=info msg="version: 8072420d"
time="2024-06-25T13:55:10Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Tue Jun 25 13:55:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000001:00:00.0 Off |                    0 |
| N/A   35C    P0             61W /  400W |       1MiB /  95830MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
time="2024-06-25T13:55:10Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-06-25T13:55:10Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

Introspection Report

inspection-report-20240625_140304.tar.gz

Can you suggest a fix?

The error suggests it has to do with https://github.com/NVIDIA/gpu-operator/issues/430. Which is open, but suggests some manual symlinks workaround. However I have not been able to figure out which symlinks to create exactly to make it work. Workaround suggestions are welcomed, thank you.

gustavosr98 commented 3 days ago

@lolwww Can you try the following

kubectl set env -n gpu-operator-resources ds nvidia-operator-validator -c nvidia-operator-validator DISABLE_DEV_CHAR_SYMLINK_CREATION=true

lolwww commented 3 days ago

@gustavosr98 didn't help Gustavo. Same result after executing it.

gustavosr98 commented 2 days ago

When checking the logs inside the VM this looks interesting to me

[..] level=info msg="Error: error validating driver installation: error creating symlink creator: failed to create NVIDIA device nodes: failed to create device node nvidiactl: failed to determine major: invalid device node\n\n

It feels like the operator is trying to create /dev/nvidiactl but when checking I saw it was already on the machine

lolwww commented 21 hours ago

After trying many things I got it working with a latest current version of nvidia-gpu-operator, as the default one does not support H100 and nvidia-driver-550.54.14: microk8s enable gpu --version v24.3.0

and a newer testing vectoradd image:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
    limits:
        nvidia.com/gpu: 1 # To request physical full GPUs
        #  nvidia.com/mig-1g.5gb: 1 # To request MIG GPUs

kubectl logs pod/vectoradd [Vector addition of 50000 elements] Test PASSED

canonical / microk8s