install nvidia/tao/tao-getting-started:4.0.0 (TAO Toolkit API) and get error message: Back-off restarting failed container

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node?
[x] Are you running Kubernetes v1.13+?
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?

[x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

Name:         cluster-policy
Namespace:    
Labels:       app.kubernetes.io/component=gpu-operator
          app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: gpu-operator-1679224169
          meta.helm.sh/release-namespace: nvidia-gpu-operator
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
Creation Timestamp:  2023-03-19T11:09:33Z
Generation:          1
Managed Fields:
API Version:  nvidia.com/v1
Fields Type:  FieldsV1
fieldsV1:
  f:metadata:
    f:annotations:
      .:
      f:meta.helm.sh/release-name:
      f:meta.helm.sh/release-namespace:
    f:labels:
      .:
      f:app.kubernetes.io/component:
      f:app.kubernetes.io/managed-by:
  f:spec:
    .:
    f:daemonsets:
      .:
      f:priorityClassName:
      f:tolerations:
    f:dcgm:
      .:
      f:enabled:
      f:hostPort:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:version:
    f:dcgmExporter:
      .:
      f:env:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:version:
    f:devicePlugin:
      .:
      f:env:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:securityContext:
        .:
        f:privileged:
      f:version:
    f:driver:
      .:
      f:certConfig:
        .:
        f:name:
      f:enabled:
      f:image:
      f:imagePullPolicy:
      f:kernelModuleConfig:
        .:
        f:name:
      f:licensingConfig:
        .:
        f:configMapName:
        f:nlsEnabled:
      f:manager:
        .:
        f:env:
        f:image:
        f:imagePullPolicy:
        f:repository:
        f:version:
      f:rdma:
        .:
        f:enabled:
        f:useHostMofed:
      f:repoConfig:
        .:
        f:configMapName:
      f:repository:
      f:securityContext:
        .:
        f:privileged:
        f:seLinuxOptions:
          .:
          f:level:
      f:version:
      f:virtualTopology:
        .:
        f:config:
    f:gfd:
      .:
      f:env:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:version:
    f:mig:
      .:
      f:strategy:
    f:migManager:
      .:
      f:config:
        .:
        f:name:
      f:enabled:
      f:env:
      f:gpuClientsConfig:
        .:
        f:name:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:securityContext:
        .:
        f:privileged:
      f:version:
    f:nodeStatusExporter:
      .:
      f:enabled:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:version:
    f:operator:
      .:
      f:defaultRuntime:
      f:initContainer:
        .:
        f:image:
        f:imagePullPolicy:
        f:repository:
        f:version:
      f:runtimeClass:
    f:psp:
      .:
      f:enabled:
    f:toolkit:
      .:
      f:enabled:
      f:image:
      f:imagePullPolicy:
      f:repository:
      f:securityContext:
        .:
        f:privileged:
        f:seLinuxOptions:
          .:
          f:level:
      f:version:
    f:validator:
      .:
      f:image:
      f:imagePullPolicy:
      f:plugin:
        .:
        f:env:
      f:repository:
      f:securityContext:
        .:
        f:privileged:
        f:seLinuxOptions:
          .:
          f:level:
      f:version:
Manager:      helm
Operation:    Update
Time:         2023-03-19T11:09:33Z
API Version:  nvidia.com/v1
Fields Type:  FieldsV1
fieldsV1:
  f:status:
    .:
    f:namespace:
    f:state:
Manager:         gpu-operator
Operation:       Update
Subresource:     status
Time:            2023-03-19T11:09:34Z
Resource Version:  776
UID:               bf30a342-ebd3-479a-a8dd-569b99fe9e67
Spec:
Daemonsets:
Priority Class Name:  system-node-critical
Tolerations:
  Effect:    NoSchedule
  Key:       nvidia.com/gpu
  Operator:  Exists
Dcgm:
Enabled:            false
Host Port:          5555
Image:              dcgm
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia/cloud-native
Version:            2.3.4-1-ubuntu20.04
Dcgm Exporter:
Env:
  Name:             DCGM_EXPORTER_LISTEN
  Value:            :9400
  Name:             DCGM_EXPORTER_KUBERNETES
  Value:            true
  Name:             DCGM_EXPORTER_COLLECTORS
  Value:            /etc/dcgm-exporter/dcp-metrics-included.csv
Image:              dcgm-exporter
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia/k8s
Version:            2.3.4-2.6.4-ubuntu20.04
Device Plugin:
Env:
  Name:             PASS_DEVICE_SPECS
  Value:            true
  Name:             FAIL_ON_INIT_ERROR
  Value:            true
  Name:             DEVICE_LIST_STRATEGY
  Value:            envvar
  Name:             DEVICE_ID_STRATEGY
  Value:            uuid
  Name:             NVIDIA_VISIBLE_DEVICES
  Value:            all
  Name:             NVIDIA_DRIVER_CAPABILITIES
  Value:            all
Image:              k8s-device-plugin
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia
Security Context:
  Privileged:  true
Version:       v0.11.0-ubi8
Driver:
Cert Config:
  Name:             
Enabled:            true
Image:              driver
Image Pull Policy:  IfNotPresent
Kernel Module Config:
  Name:  
Licensing Config:
  Config Map Name:  
  Nls Enabled:      false
Manager:
  Env:
    Name:             ENABLE_AUTO_DRAIN
    Value:            true
    Name:             DRAIN_USE_FORCE
    Value:            false
    Name:             DRAIN_POD_SELECTOR_LABEL
    Value:            
    Name:             DRAIN_TIMEOUT_SECONDS
    Value:            0s
    Name:             DRAIN_DELETE_EMPTYDIR_DATA
    Value:            false
  Image:              k8s-driver-manager
  Image Pull Policy:  IfNotPresent
  Repository:         nvcr.io/nvidia/cloud-native
  Version:            v0.3.0
Rdma:
  Enabled:         false
  Use Host Mofed:  false
Repo Config:
  Config Map Name:  
Repository:         nvcr.io/nvidia
Security Context:
  Privileged:  true
  Se Linux Options:
    Level:  s0
Version:    510.47.03
Virtual Topology:
  Config:  
Gfd:
Env:
  Name:             GFD_SLEEP_INTERVAL
  Value:            60s
  Name:             GFD_FAIL_ON_INIT_ERROR
  Value:            true
Image:              gpu-feature-discovery
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia
Version:            v0.5.0
Mig:
Strategy:  single
Mig Manager:
Config:
  Name:   
Enabled:  true
Env:
  Name:   WITH_REBOOT
  Value:  false
Gpu Clients Config:
  Name:             
Image:              k8s-mig-manager
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia/cloud-native
Security Context:
  Privileged:  true
Version:       v0.3.0-ubuntu20.04
Node Status Exporter:
Enabled:            false
Image:              gpu-operator-validator
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia/cloud-native
Version:            v1.10.1
Operator:
Default Runtime:  docker
Init Container:
  Image:              cuda
  Image Pull Policy:  IfNotPresent
  Repository:         nvcr.io/nvidia
  Version:            11.4.2-base-ubi8
Runtime Class:        nvidia
Psp:
Enabled:  false
Toolkit:
Enabled:            true
Image:              container-toolkit
Image Pull Policy:  IfNotPresent
Repository:         nvcr.io/nvidia/k8s
Security Context:
  Privileged:  true
  Se Linux Options:
    Level:  s0
Version:    v1.9.0-ubuntu18.04
Validator:
Image:              gpu-operator-validator
Image Pull Policy:  IfNotPresent
Plugin:
  Env:
    Name:    WITH_WORKLOAD
    Value:   true
Repository:  nvcr.io/nvidia/cloud-native
Security Context:
  Privileged:  true
  Se Linux Options:
    Level:  s0
Version:    v1.10.1
Status:
Namespace:  nvidia-gpu-operator
State:      notReady
Events:       <none>

1. Issue or feature description

When conduct bash setup.sh install, the process is stopped in the stage "[Waiting for the Cluster to become available]".

The logs for the pod that is in status "Init:CrashLoopBackOff" are the messages below:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia driver module is already loaded with refcount 262
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/admin-ops01 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-xvc9v condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_uvm           1163264  0
nvidia_drm             61440  8
nvidia_modeset       1142784  2 nvidia_drm
nvidia              40796160  262 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   491520  12 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node admin-ops01...
node/admin-ops01 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-gxwn9, kube-system/kube-proxy-sqgld, nvidia-gpu-operator/gpu-operator-1679224169-node-feature-discovery-worker-2j6bf, nvidia-gpu-operator/nvidia-driver-daemonset-22g6f
evicting pod nvidia-gpu-operator/gpu-operator-7bfc5f55-qr7tv
evicting pod kube-system/coredns-669f87ff6-lr76c
evicting pod nvidia-gpu-operator/gpu-operator-1679224169-node-feature-discovery-master-68b9kl8bl
evicting pod kube-system/calico-kube-controllers-7f76d48f74-gfp6v
evicting pod kube-system/coredns-669f87ff6-4jnxm
pod/gpu-operator-1679224169-node-feature-discovery-master-68b9kl8bl evicted
pod/gpu-operator-7bfc5f55-qr7tv evicted
pod/calico-kube-controllers-7f76d48f74-gfp6v evicted
pod/coredns-669f87ff6-4jnxm evicted
pod/coredns-669f87ff6-lr76c evicted
node/admin-ops01 drained
Unloading NVIDIA driver kernel modules...
nvidia_uvm           1163264  0
nvidia_drm             61440  8
nvidia_modeset       1142784  2 nvidia_drm
nvidia              40796160  262 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   491520  12 drm_kms_helper,nvidia,nvidia_drm
Could not unload NVIDIA driver kernel modules, driver is in use
Uncordoning node admin-ops01...
node/admin-ops01 uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/admin-ops01 unlabeled
Error from server (BadRequest): container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-22g6f" is waiting to start: PodInitializing

2. Steps to reproduce the issue

First, download resource using NGC CLI (ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.0") and edit the hosts file.

After configuring the hosts information, run bash setup.sh install.

3. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[x] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

NVIDIA / gpu-operator