NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.72k stars 280 forks source link

install nvidia/tao/tao-getting-started:4.0.0 (TAO Toolkit API) and get error message: Back-off restarting failed container #502

Closed ShangWeiKuo closed 1 year ago

ShangWeiKuo commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

The logs for the pod that is in status "Init:CrashLoopBackOff" are the messages below:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia driver module is already loaded with refcount 262
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/admin-ops01 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-xvc9v condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_uvm           1163264  0
nvidia_drm             61440  8
nvidia_modeset       1142784  2 nvidia_drm
nvidia              40796160  262 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   491520  12 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node admin-ops01...
node/admin-ops01 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-gxwn9, kube-system/kube-proxy-sqgld, nvidia-gpu-operator/gpu-operator-1679224169-node-feature-discovery-worker-2j6bf, nvidia-gpu-operator/nvidia-driver-daemonset-22g6f
evicting pod nvidia-gpu-operator/gpu-operator-7bfc5f55-qr7tv
evicting pod kube-system/coredns-669f87ff6-lr76c
evicting pod nvidia-gpu-operator/gpu-operator-1679224169-node-feature-discovery-master-68b9kl8bl
evicting pod kube-system/calico-kube-controllers-7f76d48f74-gfp6v
evicting pod kube-system/coredns-669f87ff6-4jnxm
pod/gpu-operator-1679224169-node-feature-discovery-master-68b9kl8bl evicted
pod/gpu-operator-7bfc5f55-qr7tv evicted
pod/calico-kube-controllers-7f76d48f74-gfp6v evicted
pod/coredns-669f87ff6-4jnxm evicted
pod/coredns-669f87ff6-lr76c evicted
node/admin-ops01 drained
Unloading NVIDIA driver kernel modules...
nvidia_uvm           1163264  0
nvidia_drm             61440  8
nvidia_modeset       1142784  2 nvidia_drm
nvidia              40796160  262 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   491520  12 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Uncordoning node admin-ops01...
node/admin-ops01 uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/admin-ops01 unlabeled
Error from server (BadRequest): container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-22g6f" is waiting to start: PodInitializing

2. Steps to reproduce the issue

First, download resource using NGC CLI (ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.0") and edit the hosts file.

After configuring the hosts information, run bash setup.sh install.

3. Information to attach (optional if deemed irrelevant)

shivamerla commented 1 year ago

@ShangWeiKuo looks like the node already have NVIDIA drivers installed. Please disable the driver container from the GPU operator by passing --set driver.enabled=false. This is automatically handled from the v1.11.x onwards. Suggest to use latest version of the GPU operator to handle this case and disable driver container automatically.