Open adamancini opened 1 year ago
@adamancini Those are warning messages that indicate if symlinks already exist. The actual container failing is toolkit-validation
within that pod with below error
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown
We don't support GPU Operator with Arch linux, @elezar do you know of any known issue with the container toolkit with Arch Linux? This is with v1.13.0 toolkit.
@adamancini you also mention that nvidia-container-toolkit is pre-installed on the node and the toolkit container is disabled with ArgoCD config, but i still see the toolkit container setup and containerd configuration is set accordingly.
gpu-operator gpu-feature-discovery-gfh5j 0/1 Init:0/1 0 10m
gpu-operator gpu-operator-6b8db67bfb-xvltr 1/1 Running 0 10m
gpu-operator nvidia-container-toolkit-daemonset-vpzbn 1/1 Running 0 10m. <-----
gpu-operator nvidia-dcgm-exporter-sjmgn 0/1 Init:0/1 0 10m
gpu-operator nvidia-device-plugin-daemonset-g8f54 0/1 Init:0/1 0 10m
gpu-operator nvidia-gpu-operator-node-feature-discovery-master-6fb7d946lk8gf 1/1 Running 0 10m
gpu-operator nvidia-gpu-operator-node-feature-discovery-worker-hpmkg 1/1 Running 0 10m
gpu-operator nvidia-operator-validator-hlbdw 0/1 Init:CrashLoopBackOff 6 (4m22s ago) 10m
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? nokubectl describe clusterpolicies --all-namespaces
)`kubectl describe clusterpolicies --all-namespaces`
``` Name: cluster-policy Namespace: Labels: app.kubernetes.io/component=gpu-operator app.kubernetes.io/instance=nvidia-gpu-operator app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=gpu-operator app.kubernetes.io/version=v23.3.2 helm.sh/chart=gpu-operator-v23.3.2 Annotations: meta.helm.sh/release-name: nvidia-gpu-operator meta.helm.sh/release-namespace: gpu-operator API Version: nvidia.com/v1 Kind: ClusterPolicy Metadata: Creation Timestamp: 2023-06-12T19:13:48Z Generation: 2 Resource Version: 243638 UID: 2f934cb6-33ac-4190-93aa-62b4e6668843 Spec: Cdi: Default: false Enabled: false Daemonsets: Labels: app.kubernetes.io/managed-by: gpu-operator helm.sh/chart: gpu-operator-v23.3.2 Priority Class Name: system-node-critical Rolling Update: Max Unavailable: 1 Tolerations: Effect: NoSchedule Key: nvidia.com/gpu Operator: Exists Update Strategy: RollingUpdate Dcgm: Enabled: false Host Port: 5555 Image: dcgm Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: 3.1.7-1-ubuntu20.04 Dcgm Exporter: Enabled: true Env: Name: DCGM_EXPORTER_LISTEN Value: :9400 Name: DCGM_EXPORTER_KUBERNETES Value: true Name: DCGM_EXPORTER_COLLECTORS Value: /etc/dcgm-exporter/dcp-metrics-included.csv Image: dcgm-exporter Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/k8s Service Monitor: Additional Labels: Enabled: false Honor Labels: false Interval: 15s Version: 3.1.7-3.1.4-ubuntu20.04 Device Plugin: Enabled: true Env: Name: PASS_DEVICE_SPECS Value: true Name: FAIL_ON_INIT_ERROR Value: true Name: DEVICE_LIST_STRATEGY Value: envvar Name: DEVICE_ID_STRATEGY Value: uuid Name: NVIDIA_VISIBLE_DEVICES Value: all Name: NVIDIA_DRIVER_CAPABILITIES Value: all Image: k8s-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.14.0-ubi8 Driver: Cert Config: Name: Enabled: false Image: driver Image Pull Policy: IfNotPresent Kernel Module Config: Name: Licensing Config: Config Map Name: Nls Enabled: false Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: true Name: ENABLE_AUTO_DRAIN Value: false Name: DRAIN_USE_FORCE Value: false Name: DRAIN_POD_SELECTOR_LABEL Value: Name: DRAIN_TIMEOUT_SECONDS Value: 0s Name: DRAIN_DELETE_EMPTYDIR_DATA Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Rdma: Enabled: false Use Host Mofed: false Repo Config: Config Map Name: Repository: nvcr.io/nvidia Startup Probe: Failure Threshold: 120 Initial Delay Seconds: 60 Period Seconds: 10 Timeout Seconds: 60 Upgrade Policy: Auto Upgrade: true Drain: Delete Empty Dir: false Enable: false Force: false Timeout Seconds: 300 Max Parallel Upgrades: 1 Max Unavailable: 25% Pod Deletion: Delete Empty Dir: false Force: false Timeout Seconds: 300 Wait For Completion: Timeout Seconds: 0 Use Precompiled: false Version: 525.105.17 Virtual Topology: Config: Gfd: Enabled: true Env: Name: GFD_SLEEP_INTERVAL Value: 60s Name: GFD_FAIL_ON_INIT_ERROR Value: true Image: gpu-feature-discovery Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v0.8.0-ubi8 Mig: Strategy: single Mig Manager: Config: Default: all-disabled Name: default-mig-parted-config Enabled: true Env: Name: WITH_REBOOT Value: false Gpu Clients Config: Name: Image: k8s-mig-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.5.2-ubuntu20.04 Node Status Exporter: Enabled: false Image: gpu-operator-validator Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v23.3.2 Operator: Default Runtime: docker Init Container: Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 12.1.1-base-ubi8 Runtime Class: nvidia Psp: Enabled: false Sandbox Device Plugin: Enabled: true Image: kubevirt-gpu-device-plugin Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: v1.2.1 Sandbox Workloads: Default Workload: container Enabled: false Toolkit: Enabled: true Env: Name: CONTAINERD_CONFIG Value: /etc/k0s/containerd.toml Name: CONTAINERD_SOCKET Value: /var/run/k0s/containerd.sock Image: container-toolkit Image Pull Policy: IfNotPresent Install Dir: /usr/local/nvidia Repository: nvcr.io/nvidia/k8s Version: v1.13.0-ubuntu20.04 Validator: Image: gpu-operator-validator Image Pull Policy: IfNotPresent Plugin: Env: Name: WITH_WORKLOAD Value: true Repository: nvcr.io/nvidia/cloud-native Version: v23.3.2 Vfio Manager: Driver Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: false Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Enabled: true Image: cuda Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia Version: 12.1.1-base-ubi8 Vgpu Device Manager: Config: Default: default Name: Enabled: true Image: vgpu-device-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.2.1 Vgpu Manager: Driver Manager: Env: Name: ENABLE_GPU_POD_EVICTION Value: false Name: ENABLE_AUTO_DRAIN Value: false Image: k8s-driver-manager Image Pull Policy: IfNotPresent Repository: nvcr.io/nvidia/cloud-native Version: v0.6.1 Enabled: false Image: vgpu-manager Image Pull Policy: IfNotPresent Status: Namespace: gpu-operator State: notReady Events:1. Issue or feature description
after installing the gpu-operator from helm, the
nvidia-operator-validator
pod goes into a CrashLoopBackOff reporting that it's unable to symlink host devices.2. Steps to reproduce the issue
installed system from Arch Linux installation. Install nvidia drivers and
nvidia-container-toolkit
package from AUR.setup kubernetes using k0sctl with the following manifest which automatically installs the helm charts listed during bootstrapping:
k0sctl manifest
``` apiVersion: k0sctl.k0sproject.io/v1beta1 kind: Cluster metadata: name: k0s-cluster spec: hosts: - ssh: address: 10.144.84.45 user: root port: 22 keyPath: /Users/ada/.ssh/id_ecdsa role: controller+worker noTaints: true uploadBinary: false installFlags: - --profile gpu-enabled files: - name: containerd-config src: bootstrap/containerd/containerd.toml dstDir: /etc/k0s/ perm: "0755" dirPerm: null k0s: version: 1.27.2+k0s.0 dynamicConfig: false config: spec: workerProfiles: - name: gpu-enabled values: cgroupDriver: systemd network: provider: calico extensions: helm: concurrencyLevel: 5 repositories: - name: stable url: https://charts.helm.sh/stable - name: prometheus-community url: https://prometheus-community.github.io/helm-charts - name: netdata url: https://netdata.github.io/helmchart/ - name: sealed-secrets url: https://bitnami-labs.github.io/sealed-secrets - name: gitlab url: https://charts.gitlab.io/ - name: codimd url: https://helm.codimd.dev - name: bitnami url: https://charts.bitnami.com/bitnami - name: traefik url: https://traefik.github.io/charts - name: intel url: https://intel.github.io/helm-charts - name: nvidia url: https://helm.ngc.nvidia.com/nvidia - name: argo url: https://argoproj.github.io/argo-helm charts: - name: argo-cd chartname: argo/argo-cd namespace: argocd values: | configs: knownHosts: data: ssh_known_hosts: | github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg= github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl - name: prometheus-stack chartname: prometheus-community/prometheus version: "14.6.1" # timeout: 20m # order: 1 values: | alertmanager: persistentVolume: enabled: false server: persistentVolume: enabled: false namespace: monitoring - name: traefik chartname: traefik/traefik version: "23.1.0" # timeout: 10m # order: 2 values: | ingressClass: enabled: true isDefaultClass: true ingressRoute: dashboard: enabled: true providers: kubernetesCRD: enabled: true kubernetesIngress: enabled: true logs: general: level: DEBUG additionalArguments: - "--entrypoints.websecure.http.tls" - "--entrypoints.plex-pms.Address=:32400" - "--entrypoints.gitlab-ssh.Address=:2222/tcp" - "--providers.kubernetesIngress.ingressClass=traefik" - "--ping" - "--metrics.prometheus" - "--log.level=DEBUG" ports: traefik: port: 9000 web: port: 8000 expose: true exposedPort: 80 protocol: TCP websecure: port: 8443 expose: true exposedPort: 443 protocol: TCP tls: enabled: true metrics: port: 9100 service: enabled: true type: NodePort namespace: traefik - name: sealed-secrets chartname: sealed-secrets/sealed-secrets namespace: sealed-secrets # order: 2 # timeout: 10m - name: nvidia-gpu-operator chartname: nvidia/gpu-operator version: "v23.3.2" # timeout: 10m # order: 1 namespace: gpu-operator values: | driver: enabled: false toolkit: env: - name: CONTAINERD_CONFIG value: /etc/k0s/containerd.toml - name: CONTAINERD_SOCKET value: /var/run/k0s/containerd.sock enabled: false ```the GPU0
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-operator gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m gpu-operator nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 11m gpu-operator nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 11m gpu-operator nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 11m gpu-operator nvidia-gpu-operator-node-feature-discovery-worker 1 1 1 1 1 11m
gpu-operator nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 11m
gpu-operator nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 11m
kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 27h
kube-system konnectivity-agent 1 1 1 1 1 kubernetes.io/os=linux 27h
kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 27h
monitoring prometheus-stack-node-exporter 1 1 1 1 1 27h
Name: nvidia-operator-validator-hlbdw Namespace: gpu-operator Priority: 2000001000 Priority Class Name: system-node-critical Runtime Class Name: nvidia Service Account: nvidia-operator-validator Node: eve.annarchy.net/10.0.0.190 Start Time: Tue, 13 Jun 2023 17:17:56 -0400 Labels: app=nvidia-operator-validator app.kubernetes.io/managed-by=gpu-operator app.kubernetes.io/part-of=gpu-operator controller-revision-hash=594474b5cc helm.sh/chart=gpu-operator-v23.3.2 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 8f8a5e4262d6e4b6d614b44ea10dca6009f3f16bd2eb9e28cc80c7682e5a883b cni.projectcalico.org/podIP: 10.244.109.115/32 cni.projectcalico.org/podIPs: 10.244.109.115/32 Status: Pending IP: 10.244.109.115 IPs: IP: 10.244.109.115 Controlled By: DaemonSet/nvidia-operator-validator Init Containers: driver-validation: Container ID: containerd://eb6e4ff67d265a56026b04d6dfd5c4d73c97fae7910bf3ee0a4bec825bdd9c1d Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:21dfc9c56b5f8bce73e60361d6e83759c3fa14dc6afc2d5ebdf1b891a936daf6 Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 13 Jun 2023 17:18:10 -0400
Finished: Tue, 13 Jun 2023 17:18:10 -0400
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro)
toolkit-validation:
Container ID: containerd://f6afbd9cf3178db48d0f2eb14ccc6a3277cf5982c5bec4eac07877ccec5bf7fe
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:21dfc9c56b5f8bce73e60361d6e83759c3fa14dc6afc2d5ebdf1b891a936daf6
Port:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown
Exit Code: 128
Started: Wed, 31 Dec 1969 19:00:00 -0500
Finished: Tue, 13 Jun 2023 17:29:14 -0400
Ready: False
Restart Count: 7
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro)
cuda-validation:
Container ID:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro)
plugin-validation:
Container ID:
Host Port:
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: true
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro)
Containers:
nvidia-operator-validator:
Container ID:
Host Port:
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kvx28 (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2 Image ID:
Port:
host-root: Type: HostPath (bare host directory volume) Path: / HostPathType:
host-dev-char: Type: HostPath (bare host directory volume) Path: /dev/char HostPathType:
kube-api-access-kvx28: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:
Normal Scheduled 11m default-scheduler Successfully assigned gpu-operator/nvidia-operator-validator-hlbdw to eve.annarchy.net Warning FailedCreatePodSandBox 11m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 11m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2" already present on machine Normal Created 11m kubelet Created container driver-validation Normal Started 11m kubelet Started container driver-validation Normal Created 10m (x4 over 11m) kubelet Created container toolkit-validation Warning Failed 10m (x4 over 11m) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown Normal Pulled 9m58s (x5 over 11m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.2" already present on machine Warning BackOff 101s (x47 over 11m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-hlbdw_gpu-operator(f498a2ff-8344-426e-83bd-ca27ec548856)
nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Detected pre-installed driver on the host" nvidia-operator-validator-g5lnf driver-validation running command chroot with args [/host nvidia-smi] nvidia-operator-validator-g5lnf driver-validation Tue Jun 13 21:30:56 2023
nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation | NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 | nvidia-operator-validator-g5lnf driver-validation |-----------------------------------------+----------------------+----------------------+ nvidia-operator-validator-g5lnf driver-validation | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | nvidia-operator-validator-g5lnf driver-validation | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | nvidia-operator-validator-g5lnf driver-validation | | | MIG M. | nvidia-operator-validator-g5lnf driver-validation |=========================================+======================+======================| nvidia-operator-validator-g5lnf driver-validation | 0 NVIDIA GeForce GTX 1070 Off| 00000000:01:00.0 Off | N/A | nvidia-operator-validator-g5lnf driver-validation | 15% 44C P5 17W / 166W| 0MiB / 8192MiB | 2% Default | nvidia-operator-validator-g5lnf driver-validation | | | N/A | nvidia-operator-validator-g5lnf driver-validation +-----------------------------------------+----------------------+----------------------+ nvidia-operator-validator-g5lnf driver-validation
nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation | Processes: | nvidia-operator-validator-g5lnf driver-validation | GPU GI CI PID Type Process name GPU Memory | nvidia-operator-validator-g5lnf driver-validation | ID ID Usage | nvidia-operator-validator-g5lnf driver-validation |=======================================================================================| nvidia-operator-validator-g5lnf driver-validation | No running processes found | nvidia-operator-validator-g5lnf driver-validation +---------------------------------------------------------------------------------------+ nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/195:254 => /dev/nvidia-modeset" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-modeset /host-dev-char/195:254: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/195:255 => /dev/nvidiactl" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidiactl /host-dev-char/195:255: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/510:0 => /dev/nvidia-uvm" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-uvm /host-dev-char/510:0: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/510:1 => /dev/nvidia-uvm-tools" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-uvm-tools /host-dev-char/510:1: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:1 => /dev/nvidia-caps/nvidia-cap1" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /host-dev-char/235:1: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:2 => /dev/nvidia-caps/nvidia-cap2" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /host-dev-char/235:2: file exists" ... nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:133 => /dev/nvidia-caps/nvidia-cap133" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap133 /host-dev-char/235:133: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:134 => /dev/nvidia-caps/nvidia-cap134" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap134 /host-dev-char/235:134: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:135 => /dev/nvidia-caps/nvidia-cap135" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap135 /host-dev-char/235:135: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:136 => /dev/nvidia-caps/nvidia-cap136" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap136 /host-dev-char/235:136: file exists" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=info msg="Creating link /host-dev-char/235:137 => /dev/nvidia-caps/nvidia-cap137" nvidia-operator-validator-g5lnf driver-validation time="2023-06-13T21:30:56Z" level=warning msg="Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap137 /host-dev-char/235:137: file exists"
[ada@eve ~]$ cat /etc/k0s/containerd.toml version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.runtime.v1.linux"] runtime = "nvidia" shim = "containerd-shim"
[ada@eve ~]$ ls -la /run/nvidia total 4 drwxr-xr-x 4 root root 100 Jun 13 21:18 . drwxr-xr-x 29 root root 700 Jun 12 19:14 .. drwxr-xr-x 2 root root 40 Jun 11 07:21 driver -rw-r--r-- 1 root root 8 Jun 13 21:18 toolkit.pid drwxr-xr-x 2 root root 60 Jun 13 21:30 validations
[ada@eve ~]$ ls -la /usr/local/nvidia/toolkit total 24288 drwxr-xr-x 3 root root 4096 Jun 13 21:18 . drwxr-xr-x 3 root root 4096 Jun 13 21:18 .. drwxr-xr-x 3 root root 4096 Jun 13 21:18 .config lrwxrwxrwx 1 root root 32 Jun 13 21:18 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.13.0 -rw-r--r-- 1 root root 2959416 Jun 13 21:18 libnvidia-container-go.so.1.13.0 lrwxrwxrwx 1 root root 29 Jun 13 21:18 libnvidia-container.so.1 -> libnvidia-container.so.1.13.0 -rwxr-xr-x 1 root root 195856 Jun 13 21:18 libnvidia-container.so.1.13.0 -rwxr-xr-x 1 root root 154 Jun 13 21:18 nvidia-container-cli -rwxr-xr-x 1 root root 47472 Jun 13 21:18 nvidia-container-cli.real -rwxr-xr-x 1 root root 342 Jun 13 21:18 nvidia-container-runtime -rwxr-xr-x 1 root root 346 Jun 13 21:18 nvidia-container-runtime.cdi -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.cdi.real -rwxr-xr-x 1 root root 355 Jun 13 21:18 nvidia-container-runtime.experimental -rwxr-xr-x 1 root root 3700568 Jun 13 21:18 nvidia-container-runtime.experimental.real -rwxr-xr-x 1 root root 203 Jun 13 21:18 nvidia-container-runtime-hook -rwxr-xr-x 1 root root 2302152 Jun 13 21:18 nvidia-container-runtime-hook.real -rwxr-xr-x 1 root root 349 Jun 13 21:18 nvidia-container-runtime.legacy -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.legacy.real -rwxr-xr-x 1 root root 3061448 Jun 13 21:18 nvidia-container-runtime.real lrwxrwxrwx 1 root root 29 Jun 13 21:18 nvidia-container-toolkit -> nvidia-container-runtime-hook -rwxr-xr-x 1 root root 100 Jun 13 21:18 nvidia-ctk -rwxr-xr-x 1 root root 6421520 Jun 13 21:18 nvidia-ctk.real
[ada@eve ~]$ ls -la /run/nvidia/driver total 0 drwxr-xr-x 2 root root 40 Jun 11 07:21 . drwxr-xr-x 4 root root 100 Jun 13 21:18 ..