NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 285 forks source link

nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown #898

Closed khaykingleb closed 1 month ago

khaykingleb commented 1 month ago

1. Quick Debug Information

2. Issue or feature description

The nvidia-operator-validator-.* pod does not start correctly and enters a Init:CrashLoopBackOff state with the nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown error message. This issue persists until the problematic pods (nvidia-operator-validator-.*, gpu-feature-discovery-.*, nvidia-dcgm-exporter-.*, nvidia-device-plugin-daemonset-.*) are deleted and recreated.

3. Steps to reproduce the issue

  1. Deploy the Helm chart with the following values:

    
    driver:
    # By default, the Operator deploys NVIDIA drivers as a container on the system.
    # Set this value to false when using the Operator on systems with pre-installed drivers.
    enabled: true
    
    # Version of the NVIDIA datacenter driver supported by the Operator.
    version: 550.90.07
    
    upgradePolicy:
    # Global switch for automatic upgrade feature.
    # If set to false all other options are ignored
    autoUpgrade: true
    # How many nodes can be upgraded in parallel.
    # 0 means no limit, all nodes will be upgraded in parallel.
    maxParallelUpgrades: 1

migManager:

The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed.

By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).

enabled: false

Controls the strategy to be used with MIG on supported NVIDIA GPUs.

Options are either mixed or single.

strategy: single

toolkit:

By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack) as a

container on the system. Set this value to false when using the Operator on systems with

pre-installed NVIDIA runtimes.

enabled: true

Version of the NVIDIA Container Toolkit supported by the Operator.

version: v1.16.1-ubuntu20.04

Environment variables for configuring the NVIDIA Container Toolkit.

NOTE: https://www.virtualthoughts.co.uk/2022/11/21/installing-using-the-nvidia-gpu-operator-in-k3s-with-rancher

env:

tariq1890 commented 1 month ago

Can you SSH into your host and run the following?

sudo dpkg -l | grep nvidia

Please ensure that there are no nvidia driver packages from a different version. If present, please clean them up and try again

khaykingleb commented 1 month ago

I see, it does indeed have some packages from a different version, even after applying sudo nvidia-uninstall -s on the node. After deleting the left packages and then rebooting the node, everything is working as expected. Thank you for your help