NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 292 forks source link

GPU driver validator errors unable to load kernel module nvidia-modeset. #671

Open eliphatfs opened 8 months ago

eliphatfs commented 8 months ago

1. Quick Debug Information

2. Issue or feature description

I installed gpu-operator with helm disabling driver and toolkits as they exist and are tested to be working. The installation was mainly for monitor metrics.

The operator-validator and container-toolkit-daemonset are in error state. The operator-validator fails with the following log:

Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-modeset: exit status 1; output=modprobe: ERROR: could not insert 'nvidia_modeset': No such device

Failed to create symlinks under /dev/char that point to all possible NVIDIA character devices.
The existence of these symlinks is required to address the following bug:

This bug impacts container runtimes configured with systemd cgroup management enabled.
To disable the symlink creation, set the following envvar in ClusterPolicy:

    validator:
      driver:
        env:
        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
          value: "true"

It seems weird that it is required to address an empty bug, and nvidia-modeset should be only relevant for display drivers. The cluster is a H100 cluster without display support.

3. Steps to reproduce the issue

I am unsure. I cannot stop and redo the whole cluster from scratch as there is other stuff running.

4. Information to attach (optional if deemed irrelevant)

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3575 G /usr/lib/xorg/Xorg 60MiB | | 0 N/A N/A 3866 G /usr/bin/gnome-shell 79MiB | +---------------------------------------------------------------------------------------+



containerd logs are too big, i cannot attach them here.
cdesiniotis commented 7 months ago

cc @shivamerla @elezar