Closed khaykingleb closed 1 month ago
Can you SSH into your host and run the following?
sudo dpkg -l | grep nvidia
Please ensure that there are no nvidia driver packages from a different version. If present, please clean them up and try again
I see, it does indeed have some packages from a different version, even after applying sudo nvidia-uninstall -s
on the node. After deleting the left packages and then rebooting the node, everything is working as expected. Thank you for your help
1. Quick Debug Information
2. Issue or feature description
The
nvidia-operator-validator-.*
pod does not start correctly and enters aInit:CrashLoopBackOff
state with thenvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
error message. This issue persists until the problematic pods (nvidia-operator-validator-.*
,gpu-feature-discovery-.*
,nvidia-dcgm-exporter-.*
,nvidia-device-plugin-daemonset-.*
) are deleted and recreated.3. Steps to reproduce the issue
Deploy the Helm chart with the following values:
migManager:
The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed.
By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).
enabled: false
Controls the strategy to be used with MIG on supported NVIDIA GPUs.
Options are either mixed or single.
strategy: single
toolkit:
By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a
container on the system. Set this value to false when using the Operator on systems with
pre-installed NVIDIA runtimes.
enabled: true
Version of the NVIDIA Container Toolkit supported by the Operator.
version: v1.16.1-ubuntu20.04
Environment variables for configuring the NVIDIA Container Toolkit.
NOTE: https://www.virtualthoughts.co.uk/2022/11/21/installing-using-the-nvidia-gpu-operator-in-k3s-with-rancher
env: