NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

Nvidia-driver always installing the driver when the pod restarts #831

Open uselessidbr opened 4 months ago

uselessidbr commented 4 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Every time the pod nvidia-driver-daemonset restarts it install the driver over again even when the modules are loaded

3. Steps to reproduce the issue

Detailed steps to reproduce the issue. Just kill the “nvidia-driver-daemonset” pod and it will trigger the driver reinstall

4. Information to attach (optional if deemed irrelevant)

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2087110 C python3 7112MiB | | 0 N/A N/A 2088265 C python3 7112MiB | +-----------------------------------------------------------------------------------------+`

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

cdesiniotis commented 4 months ago

@uselessidbr this is the current limitation with our driver containers -- it will always re-install the driver on a container restart. There is some work ongoing to avoid re-installations on container restarts, but there is no timeline for when that would make it into a release.

uselessidbr commented 4 months ago

@uselessidbr this is the current limitation with our driver containers -- it will always re-install the driver on a container restart. There is some work ongoing to avoid re-installations on container restarts, but there is no timeline for when that would make it into a release.

@cdesiniotis Thanks very much for your reply!

This behaviour is triggering another problem, causing this error within initContainer:

Could not unload NVIDIA driver kernel modules, driver is in use

To workaround it, i had to set:

  manager:
     - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"

Also:

 upgradePolicy:
    autoUpgrade: false

The latter, to workaround this error:

Auto eviction of GPU pods on node NODENAME is disabled by the upgrade policy

Which is being triggered even with this settings:

    gpuPodDeletion:
      force: true
      timeoutSeconds: 300
      deleteEmptyDir: true

Any reason why we’re getting this error?