I installed gpu-operator with helm disabling driver and toolkits as they exist and are tested to be working.
The installation was mainly for monitor metrics.
The operator-validator and container-toolkit-daemonset are in error state. The operator-validator fails with the following log:
Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-modeset: exit status 1; output=modprobe: ERROR: could not insert 'nvidia_modeset': No such device
Failed to create symlinks under /dev/char that point to all possible NVIDIA character devices.
The existence of these symlinks is required to address the following bug:
This bug impacts container runtimes configured with systemd cgroup management enabled.
To disable the symlink creation, set the following envvar in ClusterPolicy:
validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
It seems weird that it is required to address an empty bug, and nvidia-modeset should be only relevant for display drivers.
The cluster is a H100 cluster without display support.
3. Steps to reproduce the issue
I am unsure. I cannot stop and redo the whole cluster from scratch as there is other stuff running.
4. Information to attach (optional if deemed irrelevant)
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3575 G /usr/lib/xorg/Xorg 60MiB |
| 0 N/A N/A 3866 G /usr/bin/gnome-shell 79MiB |
+---------------------------------------------------------------------------------------+
containerd logs are too big, i cannot attach them here.
1. Quick Debug Information
2. Issue or feature description
I installed
gpu-operator
withhelm
disabling driver and toolkits as they exist and are tested to be working. The installation was mainly for monitor metrics.The operator-validator and container-toolkit-daemonset are in error state. The operator-validator fails with the following log:
It seems weird that it is required to address an empty bug, and nvidia-modeset should be only relevant for display drivers. The cluster is a H100 cluster without display support.
3. Steps to reproduce the issue
I am unsure. I cannot stop and redo the whole cluster from scratch as there is other stuff running.
4. Information to attach (optional if deemed irrelevant)
kubectl get ds -n OPERATOR_NAMESPACE
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3575 G /usr/lib/xorg/Xorg 60MiB | | 0 N/A N/A 3866 G /usr/bin/gnome-shell 79MiB | +---------------------------------------------------------------------------------------+