NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 298 forks source link

GPU operator validator fails to create host device symlinks #539

Open adamancini opened 1 year ago

adamancini commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:13:27Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2+k0s", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-23T11:38:50Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"linux/amd64"}
`kubectl describe clusterpolicies --all-namespaces` ``` Name: cluster-policy Namespace: Labels: app.kubernetes.io/component=gpu-operator app.kubernetes.io/instance=nvidia-gpu-operator app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=gpu-operator app.kubernetes.io/version=v23.3.2 helm.sh/chart=gpu-operator-v23.3.2 Annotations: meta.helm.sh/release-name: nvidia-gpu-operator meta.helm.sh/release-namespace: gpu-operator API Version: nvidia.com/v1 Kind: ClusterPolicy Metadata: Creation Timestamp: 2023-06-12T19:13:48Z Generation: 2 Resource Version: 243638 UID: 2f934cb6-33ac-4190-93aa-62b4e6668843 Spec: Cdi: Default: false Enabled: false Daemonsets: Labels: app.kubernetes.io/managed-by: gpu-operator helm.sh/chart: gpu-operator-v23.3.2 Priority Class Name: system-node-critical Rolling Update: Max Unavailable: 1 Tolerations: Effect: NoSchedule Key: nvidia.com/gpu Operator: Exists Update Strategy: RollingUpdate Dcgm: Enabled: false Host Port: 5555 Image: dcgm Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: 3.1.7-1-ubuntu20.04 Dcgm Exporter: Enabled: true Env: Name: DCGM_EXPORTER_LISTEN Value: :9400 Name: DCGM_EXPORTER_KUBERNETES Value: true Name: DCGM_EXPORTER_COLLECTORS Value: /etc/dcgm-exporter/dcp-metrics-included.csv Image: dcgm-exporter Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/k8s Service Monitor: Additional Labels: Enabled: false Honor Labels: false Interval: 15s Version: 3.1.7-3.1.4-ubuntu20.04 Device Plugin: Enabled: true Env: Name: PASS_DEVICE_SPECS Value: true Name: FAIL_ON_INIT_ERROR Value: true Name: DEVICE_LIST_STRATEGY Value: envvar Name: DEVICE_ID_STRATEGY Value: uuid Name: NVIDIA_VISIBLE_DEVICES Value: all Name: NVIDIA_DRIVER_CAPABILITIES Value: all Image: k8s-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.14.0-ubi8 Driver: Cert Config: Name: Enabled: false Image: driver Image Pull Policy: IfNotPresent Kernel Module Config: Name: Licensing Config: Config Map Name: Nls Enabled: false Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: true Name: ENABLE_AUTO_DRAIN Value: false Name: DRAIN_USE_FORCE Value: false Name: DRAIN_POD_SELECTOR_LABEL Value: Name: DRAIN_TIMEOUT_SECONDS Value: 0s Name: DRAIN_DELETE_EMPTYDIR_DATA Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Rdma: Enabled: false Use Host Mofed: false Repo Config: Config Map Name: Repository: nvcr.io/nvidia Startup Probe: Failure Threshold: 120 Initial Delay Seconds: 60 Period Seconds: 10 Timeout Seconds: 60 Upgrade Policy: Auto Upgrade: true Drain: Delete Empty Dir: false Enable: false Force: false Timeout Seconds: 300 Max Parallel Upgrades: 1 Max Unavailable: 25% Pod Deletion: Delete Empty Dir: false Force: false Timeout Seconds: 300 Wait For Completion: Timeout Seconds: 0 Use Precompiled: false Version: 525.105.17 Virtual Topology: Config: Gfd: Enabled: true Env: Name: GFD_SLEEP_INTERVAL Value: 60s Name: GFD_FAIL_ON_INIT_ERROR Value: true Image: gpu-feature-discovery Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.8.0-ubi8 Mig: Strategy: single Mig Manager: Config: Default: all-disabled Name: default-mig-parted-config Enabled: true Env: Name: WITH_REBOOT Value: false Gpu Clients Config: Name: Image: k8s-mig-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.5.2-ubuntu20.04 Node Status Exporter: Enabled: false Image: gpu-operator-validator Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v23.3.2 Operator: Default Runtime: docker Init Container: Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 12.1.1-base-ubi8 Runtime Class: nvidia Psp: Enabled: false Sandbox Device Plugin: Enabled: true Image: kubevirt-gpu-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v1.2.1 Sandbox Workloads: Default Workload: container Enabled: false Toolkit: Enabled: true Env: Name: CONTAINERD_CONFIG Value: /etc/k0s/containerd.toml Name: CONTAINERD_SOCKET Value: /var/run/k0s/containerd.sock Image: container-toolkit Image Pull Policy: IfNotPresent Install Dir: /usr/local/nvidia Repository: nvcr.io/nvidia/k8s Version: v1.13.0-ubuntu20.04 Validator: Image: gpu-operator-validator Image Pull Policy: IfNotPresent Plugin: Env: Name: WITH_WORKLOAD Value: true Repository: nvcr.io/nvidia/cloud-native Version: v23.3.2 Vfio Manager: Driver Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: false Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Enabled: true Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 12.1.1-base-ubi8 Vgpu Device Manager: Config: Default: default Name: Enabled: true Image: vgpu-device-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.2.1 Vgpu Manager: Driver Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: false Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Enabled: false Image: vgpu-manager Image Pull Policy: IfNotPresent Status: Namespace: gpu-operator State: notReady Events: ```

1. Issue or feature description

after installing the gpu-operator from helm, the nvidia-operator-validator pod goes into a CrashLoopBackOff reporting that it's unable to symlink host devices.

Creating link /host-dev-char/235:136 => /dev/nvidia-caps/nvidia-cap136"
nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap136 /host-dev-char/235:136: file exists

2. Steps to reproduce the issue

installed system from Arch Linux installation. Install nvidia drivers and nvidia-container-toolkit package from AUR.

setup kubernetes using k0sctl with the following manifest which automatically installs the helm charts listed during bootstrapping:

k0sctl manifest ``` apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: k0s-cluster spec: hosts: - ssh: address: 10.144.84.45 user: root port: 22 keyPath: /Users/ada/.ssh/id_ecdsa role: controller+worker noTaints: true uploadBinary: false installFlags: - --profile gpu-enabled files: - name: containerd-config src: bootstrap/containerd/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null k0s: version: 1.27.2+k0s.0 dynamicConfig: false config: spec: workerProfiles: - name: gpu-enabled values: cgroupDriver: systemd network: provider: calico extensions: helm: concurrencyLevel: 5 repositories: - name: stable url: https://charts.helm.sh/stable - name: prometheus-community url: https://prometheus-community.github.io/helm-charts - name: netdata url: https://netdata.github.io/helmchart/ - name: sealed-secrets url: https://bitnami-labs.github.io/sealed-secrets - name: gitlab url: https://charts.gitlab.io/ - name: codimd url: https://helm.codimd.dev - name: bitnami url: https://charts.bitnami.com/bitnami - name: traefik url: https://traefik.github.io/charts - name: intel url: https://intel.github.io/helm-charts - name: nvidia url: https://helm.ngc.nvidia.com/nvidia - name: argo url: https://argoproj.github.io/argo-helm charts: - name: argo-cd chartname: argo/argo-cd namespace: argocd values: | configs: knownHosts: data: ssh_known_hosts: | github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg= github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl - name: prometheus-stack chartname: prometheus-community/prometheus version: "14.6.1" # timeout: 20m # order: 1 values: | alertmanager: persistentVolume: enabled: false server: persistentVolume: enabled: false namespace: monitoring - name: traefik chartname: traefik/traefik version: "23.1.0" # timeout: 10m # order: 2 values: | ingressClass: enabled: true isDefaultClass: true ingressRoute: dashboard: enabled: true providers: kubernetesCRD: enabled: true kubernetesIngress: enabled: true logs: general: level: DEBUG additionalArguments: - "--entrypoints.websecure.http.tls" - "--entrypoints.plex-pms.Address=:32400" - "--entrypoints.gitlab-ssh.Address=:2222/tcp" - "--providers.kubernetesIngress.ingressClass=traefik" - "--ping" - "--metrics.prometheus" - "--log.level=DEBUG" ports: traefik: port: 9000 web: port: 8000 expose: true exposedPort: 80 protocol: TCP websecure: port: 8443 expose: true exposedPort: 443 protocol: TCP tls: enabled: true metrics: port: 9100 service: enabled: true type: NodePort namespace: traefik - name: sealed-secrets chartname: sealed-secrets/sealed-secrets namespace: sealed-secrets # order: 2 # timeout: 10m - name: nvidia-gpu-operator chartname: nvidia/gpu-operator version: "v23.3.2" # timeout: 10m # order: 1 namespace: gpu-operator values: | driver: enabled: false toolkit: env: - name: CONTAINERD_CONFIG value: /etc/k0s/containerd.toml - name: CONTAINERD_SOCKET value: /var/run/k0s/containerd.sock enabled: false ```

the GPU0

3. Information to attach (optional if deemed irrelevant)


 - [ ] kubernetes daemonset status: `kubectl get ds --all-namespaces`

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-operator gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m gpu-operator nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 11m gpu-operator nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 11m gpu-operator nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 11m gpu-operator nvidia-gpu-operator-node-feature-discovery-worker 1 1 1 1 1 11m gpu-operator nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 11m gpu-operator nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 11m kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 27h kube-system konnectivity-agent 1 1 1 1 1 kubernetes.io/os=linux 27h kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 27h monitoring prometheus-stack-node-exporter 1 1 1 1 1 27h

 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n NAMESPACE POD_NAME`

Name: nvidia-operator-validator-hlbdw Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Runtime Class Name: nvidia Service Account: nvidia-operator-validator Node: eve.annarchy.net/10.0.0.190 Start Time: Tue, 13 Jun 2023 17:17:56 -0400 Labels: app=nvidia-operator-validator app.kubernetes.io/managed-by=gpu-operator app.kubernetes.io/part-of=gpu-operator controller-revision-hash=594474b5cc helm.sh/chart=gpu-operator-v23.3.2 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 8f8a5e4262d6e4b6d614b44ea10dca6009f3f16bd2eb9e28cc80c7682e5a883b cni.projectcalico.org/podIP: 10.244.109.115/32 cni.projectcalico.org/podIPs: 10.244.109.115/32 Status: Pending IP: 10.244.109.115 IPs: IP: 10.244.109.115 Controlled By: DaemonSet/nvidia-operator-validator Init Containers: driver-validation: Container ID: containerd://eb6e4ff67d265a56026b04d6dfd5c4d73c97fae7910bf3ee0a4bec825bdd9c1d Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:21dfc9c56b5f8bce73e60361d6e83759c3fa14dc6afc2d5ebdf1b891a936daf6 Port: Host Port: Command: sh -c Args: nvidia-validator State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 13 Jun 2023 17:18:10 -0400 Finished: Tue, 13 Jun 2023 17:18:10 -0400 Ready: True Restart Count: 0 Environment: WITH_WAIT: true COMPONENT: driver Mounts: /host from host-root (ro) /host-dev-char from host-dev-char (rw) /run/nvidia/driver from driver-install-path (rw) /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro) toolkit-validation: Container ID: containerd://f6afbd9cf3178db48d0f2eb14ccc6a3277cf5982c5bec4eac07877ccec5bf7fe Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:21dfc9c56b5f8bce73e60361d6e83759c3fa14dc6afc2d5ebdf1b891a936daf6 Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown Exit Code: 128 Started: Wed, 31 Dec 1969 19:00:00 -0500 Finished: Tue, 13 Jun 2023 17:29:14 -0400 Ready: False Restart Count: 7 Environment: NVIDIA_VISIBLE_DEVICES: all WITH_WAIT: false COMPONENT: toolkit Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro) cuda-validation: Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: WITH_WAIT: false COMPONENT: cuda NODE_NAME: (v1:spec.nodeName) OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace) VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent VALIDATOR_RUNTIME_CLASS: nvidia Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro) plugin-validation: Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port: Host Port: Command: sh -c Args: nvidia-validator State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: COMPONENT: plugin WITH_WAIT: false WITH_WORKLOAD: true MIG_STRATEGY: single NODE_NAME: (v1:spec.nodeName) OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace) VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent VALIDATOR_RUNTIME_CLASS: nvidia Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro) Containers: nvidia-operator-validator: Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port: Host Port: Command: sh -c Args: echo all validations are successful; sleep infinity State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: Mounts: /run/nvidia/validations from run-nvidia-validations (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: run-nvidia-validations: Type: HostPath (bare host directory volume) Path: /run/nvidia/validations HostPathType: DirectoryOrCreate driver-install-path: Type: HostPath (bare host directory volume) Path: /run/nvidia/driver HostPathType:
host-root: Type: HostPath (bare host directory volume) Path: / HostPathType:
host-dev-char: Type: HostPath (bare host directory volume) Path: /dev/char HostPathType:
kube-api-access-kvx28: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Normal Scheduled 11m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-hlbdw to eve.annarchy.net Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 11m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2" already present on machine Normal Created 11m kubelet Created container driver-validation Normal Started 11m kubelet Started container driver-validation Normal Created 10m (x4 over 11m) kubelet Created container toolkit-validation Warning Failed 10m (x4 over 11m) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown Normal Pulled 9m58s (x5 over 11m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2" already present on machine Warning BackOff 101s (x47 over 11m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-hlbdw_gpu-operator(f498a2ff-8344-426e-83bd-ca27ec548856)


 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n NAMESPACE POD_NAME`

nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Detected pre-installed driver on the host" nvidia-operator-validator-g5lnf driver-validation running command chroot with args [/host nvidia-smi] nvidia-operator-validator-g5lnf driver-validation Tue Jun 13 21:30:56 2023
nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation | NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 | nvidia-operator-validator-g5lnf driver-validation |-----------------------------------------+----------------------+----------------------+ nvidia-operator-validator-g5lnf driver-validation | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | nvidia-operator-validator-g5lnf driver-validation | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | nvidia-operator-validator-g5lnf driver-validation | | | MIG M. | nvidia-operator-validator-g5lnf driver-validation |=========================================+======================+======================| nvidia-operator-validator-g5lnf driver-validation | 0 NVIDIA GeForce GTX 1070 Off| 00000000:01:00.0 Off | N/A | nvidia-operator-validator-g5lnf driver-validation | 15% 44C P5 17W / 166W| 0MiB / 8192MiB | 2% Default | nvidia-operator-validator-g5lnf driver-validation | | | N/A | nvidia-operator-validator-g5lnf driver-validation +-----------------------------------------+----------------------+----------------------+ nvidia-operator-validator-g5lnf driver-validation
nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation | Processes: | nvidia-operator-validator-g5lnf driver-validation | GPU GI CI PID Type Process name GPU Memory | nvidia-operator-validator-g5lnf driver-validation | ID ID Usage | nvidia-operator-validator-g5lnf driver-validation |=======================================================================================| nvidia-operator-validator-g5lnf driver-validation | No running processes found | nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/195:254 => /dev/nvidia-modeset" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-modeset /host-dev-char/195:254: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/195:255 => /dev/nvidiactl" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidiactl /host-dev-char/195:255: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/510:0 => /dev/nvidia-uvm" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-uvm /host-dev-char/510:0: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/510:1 => /dev/nvidia-uvm-tools" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-uvm-tools /host-dev-char/510:1: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:1 => /dev/nvidia-caps/nvidia-cap1" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /host-dev-char/235:1: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:2 => /dev/nvidia-caps/nvidia-cap2" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /host-dev-char/235:2: file exists" ... nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:133 => /dev/nvidia-caps/nvidia-cap133" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap133 /host-dev-char/235:133: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:134 => /dev/nvidia-caps/nvidia-cap134" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap134 /host-dev-char/235:134: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:135 => /dev/nvidia-caps/nvidia-cap135" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap135 /host-dev-char/235:135: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:136 => /dev/nvidia-caps/nvidia-cap136" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap136 /host-dev-char/235:136: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:137 => /dev/nvidia-caps/nvidia-cap137" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap137 /host-dev-char/235:137: file exists"


 - [ ] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - [ ] Containerd configuration file: `cat /etc/k0s/containerd.toml`

[ada@eve ~]$ cat /etc/k0s/containerd.toml version = 2

[plugins]

[plugins."io.containerd.grpc.v1.cri"]

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"

[plugins."io.containerd.runtime.v1.linux"] runtime = "nvidia" shim = "containerd-shim"

 - [ ] NVIDIA shared directory: `ls -la /run/nvidia`

[ada@eve ~]$ ls -la /run/nvidia total 4 drwxr-xr-x 4 root root 100 Jun 13 21:18 . drwxr-xr-x 29 root root 700 Jun 12 19:14 .. drwxr-xr-x 2 root root 40 Jun 11 07:21 driver -rw-r--r-- 1 root root 8 Jun 13 21:18 toolkit.pid drwxr-xr-x 2 root root 60 Jun 13 21:30 validations

 - [ ] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`

[ada@eve ~]$ ls -la /usr/local/nvidia/toolkit total 24288 drwxr-xr-x 3 root root 4096 Jun 13 21:18 . drwxr-xr-x 3 root root 4096 Jun 13 21:18 .. drwxr-xr-x 3 root root 4096 Jun 13 21:18 .config lrwxrwxrwx 1 root root 32 Jun 13 21:18 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.13.0 -rw-r--r-- 1 root root 2959416 Jun 13 21:18 libnvidia-container-go.so.1.13.0 lrwxrwxrwx 1 root root 29 Jun 13 21:18 libnvidia-container.so.1 -> libnvidia-container.so.1.13.0 -rwxr-xr-x 1 root root 195856 Jun 13 21:18 libnvidia-container.so.1.13.0 -rwxr-xr-x 1 root root 154 Jun 13 21:18 nvidia-container-cli -rwxr-xr-x 1 root root 47472 Jun 13 21:18 nvidia-container-cli.real -rwxr-xr-x 1 root root 342 Jun 13 21:18 nvidia-container-runtime -rwxr-xr-x 1 root root 346 Jun 13 21:18 nvidia-container-runtime.cdi -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.cdi.real -rwxr-xr-x 1 root root 355 Jun 13 21:18 nvidia-container-runtime.experimental -rwxr-xr-x 1 root root 3700568 Jun 13 21:18 nvidia-container-runtime.experimental.real -rwxr-xr-x 1 root root 203 Jun 13 21:18 nvidia-container-runtime-hook -rwxr-xr-x 1 root root 2302152 Jun 13 21:18 nvidia-container-runtime-hook.real -rwxr-xr-x 1 root root 349 Jun 13 21:18 nvidia-container-runtime.legacy -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.legacy.real -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.real lrwxrwxrwx 1 root root 29 Jun 13 21:18 nvidia-container-toolkit -> nvidia-container-runtime-hook -rwxr-xr-x 1 root root 100 Jun 13 21:18 nvidia-ctk -rwxr-xr-x 1 root root 6421520 Jun 13 21:18 nvidia-ctk.real

 - [ ] NVIDIA driver directory: `ls -la /run/nvidia/driver`

[ada@eve ~]$ ls -la /run/nvidia/driver total 0 drwxr-xr-x 2 root root 40 Jun 11 07:21 . drwxr-xr-x 4 root root 100 Jun 13 21:18 ..


 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`
can't find where k0s configures kubelet to emit logs at the moment but I'll continue looking.

Looks similar to https://github.com/NVIDIA/gpu-operator/issues/531
shivamerla commented 1 year ago

@adamancini Those are warning messages that indicate if symlinks already exist. The actual container failing is toolkit-validation within that pod with below error

      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown

We don't support GPU Operator with Arch linux, @elezar do you know of any known issue with the container toolkit with Arch Linux? This is with v1.13.0 toolkit.

shivamerla commented 1 year ago

@adamancini you also mention that nvidia-container-toolkit is pre-installed on the node and the toolkit container is disabled with ArgoCD config, but i still see the toolkit container setup and containerd configuration is set accordingly.

gpu-operator     gpu-feature-discovery-gfh5j                                       0/1     Init:0/1                0               10m
gpu-operator     gpu-operator-6b8db67bfb-xvltr                                     1/1     Running                 0               10m
gpu-operator     nvidia-container-toolkit-daemonset-vpzbn                          1/1     Running                 0               10m. <-----
gpu-operator     nvidia-dcgm-exporter-sjmgn                                        0/1     Init:0/1                0               10m
gpu-operator     nvidia-device-plugin-daemonset-g8f54                              0/1     Init:0/1                0               10m
gpu-operator     nvidia-gpu-operator-node-feature-discovery-master-6fb7d946lk8gf   1/1     Running                 0               10m
gpu-operator     nvidia-gpu-operator-node-feature-discovery-worker-hpmkg           1/1     Running                 0               10m
gpu-operator     nvidia-operator-validator-hlbdw                                   0/1     Init:CrashLoopBackOff   6 (4m22s ago)   10m