Open karanveersingh5623 opened 1 year ago
@shivamerla below is log from init container
[root@master88 ~]# kubectl -n gpu-operator logs nvidia-operator-validator-l2p97 toolkit-validation
Inconsistency detected by ld.so: dl-call-libc-early-init.c: 37: _dl_call_libc_early_init: Assertion `sym != NULL' failed!
@shivamerla another GPU node which is working
[root@master88 ~]# kubectl -n gpu-operator logs pod/nvidia-operator-validator-pzftd toolkit-validation
Tue Sep 12 07:06:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:17:00.0 Off | 0 |
| N/A 36C P0 46W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:65:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | 0 |
| N/A 40C P0 45W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000000:E3:00.0 Off | 0 |
| N/A 38C P0 44W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
@shivamerla , sent must-gather logs by email , please check
nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
nvcr.io/nvidia/k8s/container-toolkit:v1.13.0-ubuntu20.04
registry.k8s.io/nfd/node-feature-discovery:v0.12.1
nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
nvcr.io/nvidia/k8s-device-plugin:v0.14.0-ubi8
nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.2-ubuntu20.04
nvcr.io/nvidia/gpu-operator:v23.3.1
nvcr.io/nvidia/gpu-operator:devel-ubi8
nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.1
nvcr.io/nvidia/cloud-native/nvidia-fs:2.15.1
Hi Karan,
You mentioned this is a BCM cluster, one thing that could explain the issue you are seeing is a corrupt /var/lib/containerd
directory. Is it possible that a grabimage
was performed on the software image of the malfunctioning node? (likely from a node other than node002
in your case?)
In the software image please make sure that /var/lib/containerd/
is empty (full path something like: /cm/images/<image_name>/var/lib/containerd
). If it is not, please empty it, and reprovision the broken node(s).
The reason I'm suggesting this is the last time I saw that error message (Inconsistency detected by ld.so: dl-call-libc-early-init.c: 37: _dl_call_libc_early_init: Assertion `sym != NULL' failed!) this was the issue.
Ray
@rayburgemeestre
I tried using image that was working before but idk what is happening . please check the trace below.
[root@master88 gpu-operator-23.3.1]# kubectl get all -n gpu-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/gpu-feature-discovery-2svjf 0/1 Init:0/1 0 1s <none> node003 <none> <none>
pod/gpu-feature-discovery-hjwsf 0/1 Pending 0 1s <none> node002 <none> <none>
pod/gpu-feature-discovery-j96kv 0/1 Pending 0 1s <none> node004 <none> <none>
pod/gpu-operator-8c64dfbbc-hfd89 1/1 Running 0 20s 172.29.107.152 node004 <none> <none>
pod/gpu-operator-node-feature-discovery-master-7db9bfdd5b-fztf4 1/1 Running 0 20s 172.29.107.142 node004 <none> <none>
pod/gpu-operator-node-feature-discovery-worker-6wx75 1/1 Running 0 20s 172.29.67.250 node003 <none> <none>
pod/gpu-operator-node-feature-discovery-worker-qqbfj 0/1 ContainerCreating 0 20s <none> node002 <none> <none>
pod/gpu-operator-node-feature-discovery-worker-tp2p7 1/1 Running 0 20s 172.29.107.167 node004 <none> <none>
pod/nvidia-container-toolkit-daemonset-rj5p8 0/1 Init:0/1 0 2s <none> node002 <none> <none>
pod/nvidia-container-toolkit-daemonset-z8lzx 0/1 Init:0/1 0 2s <none> node003 <none> <none>
pod/nvidia-container-toolkit-daemonset-zqb2q 0/1 Init:0/1 0 2s <none> node004 <none> <none>
pod/nvidia-dcgm-exporter-24wkh 0/1 Init:0/1 0 2s <none> node003 <none> <none>
pod/nvidia-dcgm-exporter-7ppw6 0/1 Init:0/1 0 2s <none> node002 <none> <none>
pod/nvidia-dcgm-exporter-gpz2s 0/1 Init:0/1 0 2s <none> node004 <none> <none>
pod/nvidia-device-plugin-daemonset-2bqgf 0/1 Init:0/1 0 2s <none> node003 <none> <none>
pod/nvidia-device-plugin-daemonset-5npqz 0/1 Init:0/1 0 2s <none> node002 <none> <none>
pod/nvidia-device-plugin-daemonset-s2jml 0/1 Init:0/1 0 2s <none> node004 <none> <none>
pod/nvidia-driver-daemonset-6hrbg 0/1 Init:0/1 0 2s <none> node002 <none> <none>
pod/nvidia-mig-manager-24ln2 0/1 Pending 0 1s <none> node004 <none> <none>
pod/nvidia-mig-manager-j8wtr 0/1 Pending 0 1s <none> node002 <none> <none>
pod/nvidia-mig-manager-x7nht 0/1 Pending 0 1s <none> node003 <none> <none>
pod/nvidia-operator-validator-2qllx 0/1 Init:0/4 0 2s <none> node002 <none> <none>
pod/nvidia-operator-validator-n72vf 0/1 Init:0/4 0 2s <none> node003 <none> <none>
pod/nvidia-operator-validator-szmvg 0/1 Init:0/4 0 2s <none> node004 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/gpu-operator ClusterIP 10.150.116.59 <none> 8080/TCP 2s app=gpu-operator
service/gpu-operator-node-feature-discovery-master ClusterIP 10.150.24.30 <none> 8080/TCP 20s app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master
service/nvidia-dcgm-exporter ClusterIP 10.150.22.18 <none> 9400/TCP 2s app=nvidia-dcgm-exporter
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
daemonset.apps/gpu-feature-discovery 3 3 0 3 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 2s gpu-feature-discovery nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8 app=gpu-feature-discovery,app.kubernetes.io/part-of=nvidia-gpu
daemonset.apps/gpu-operator-node-feature-discovery-worker 3 3 2 3 2 <none> 20s worker registry.k8s.io/nfd/node-feature-discovery:v0.12.1 app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=worker
daemonset.apps/nvidia-container-toolkit-daemonset 3 3 0 3 0 nvidia.com/gpu.deploy.container-toolkit=true 2s nvidia-container-toolkit-ctr nvcr.io/nvidia/k8s/container-toolkit:v1.13.0-ubuntu20.04 app=nvidia-container-toolkit-daemonset
daemonset.apps/nvidia-dcgm-exporter 3 3 0 3 0 nvidia.com/gpu.deploy.dcgm-exporter=true 2s nvidia-dcgm-exporter nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04 app=nvidia-dcgm-exporter
daemonset.apps/nvidia-device-plugin-daemonset 3 3 0 3 0 nvidia.com/gpu.deploy.device-plugin=true 2s nvidia-device-plugin nvcr.io/nvidia/k8s-device-plugin:v0.14.0-ubi8 app=nvidia-device-plugin-daemonset
daemonset.apps/nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 2s nvidia-driver-ctr nvcr.io/nvidia/driver:525.105.17-rocky8.6 app=nvidia-driver-daemonset
daemonset.apps/nvidia-mig-manager 3 3 0 3 0 nvidia.com/gpu.deploy.mig-manager=true 2s nvidia-mig-manager nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.2-ubuntu20.04 app=nvidia-mig-manager
daemonset.apps/nvidia-operator-validator 3 3 0 3 0 nvidia.com/gpu.deploy.operator-validator=true 2s nvidia-operator-validator nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1 app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/gpu-operator 1/1 1 1 20s gpu-operator nvcr.io/nvidia/gpu-operator:v23.3.1 app=gpu-operator,app.kubernetes.io/component=gpu-operator
deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 20s master registry.k8s.io/nfd/node-feature-discovery:v0.12.1 app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/gpu-operator-8c64dfbbc 1 1 1 20s gpu-operator nvcr.io/nvidia/gpu-operator:v23.3.1 app=gpu-operator,app.kubernetes.io/component=gpu-operator,pod-template-hash=8c64dfbbc
replicaset.apps/gpu-operator-node-feature-discovery-master-7db9bfdd5b 1 1 1 20s master registry.k8s.io/nfd/node-feature-discovery:v0.12.1 app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=7db9bfdd5b,role=master
[root@master88 gpu-operator-23.3.1]# kubectl describe pod gpu-operator-node-feature-discovery-worker-qqbfj -n gpu-operator
Name: gpu-operator-node-feature-discovery-worker-qqbfj
Namespace: gpu-operator
Priority: 0
Node: node002/192.168.61.92
Start Time: Mon, 18 Sep 2023 11:46:31 +0900
Labels: app.kubernetes.io/instance=gpu-operator
app.kubernetes.io/name=node-feature-discovery
controller-revision-hash=67b4854db8
pod-template-generation=1
role=worker
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/gpu-operator-node-feature-discovery-worker
Containers:
worker:
Container ID:
Image: registry.k8s.io/nfd/node-feature-discovery:v0.12.1
Image ID:
Port: <none>
Host Port: <none>
Command:
nfd-worker
Args:
--server=gpu-operator-node-feature-discovery-master:8080
-enable-nodefeature-api
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/kubernetes/node-feature-discovery from nfd-worker-conf (ro)
/etc/kubernetes/node-feature-discovery/features.d/ from features-d (ro)
/etc/kubernetes/node-feature-discovery/source.d/ from source-d (ro)
/host-boot from host-boot (ro)
/host-etc/os-release from host-os-release (ro)
/host-sys from host-sys (ro)
/host-usr/lib from host-usr-lib (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lj6vg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
host-boot:
Type: HostPath (bare host directory volume)
Path: /boot
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
host-usr-lib:
Type: HostPath (bare host directory volume)
Path: /usr/lib
HostPathType:
source-d:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/source.d/
HostPathType:
features-d:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/node-feature-discovery/features.d/
HostPathType:
nfd-worker-conf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gpu-operator-node-feature-discovery-worker-conf
Optional: false
kube-api-access-lj6vg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53s default-scheduler Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-qqbfj to node002
Warning FailedCreatePodSandBox 32s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "registry.k8s.io/pause:3.6": failed to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.6": dial tcp: lookup registry.k8s.io on 192.168.61.88:53: read udp 192.168.61.92:55766->192.168.61.88:53: i/o timeout
Warning FailedCreatePodSandBox 12s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "registry.k8s.io/pause:3.6": failed to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.6": dial tcp: lookup registry.k8s.io on 192.168.61.88:53: server misbehaving
Warning DNSConfigForming 1s (x3 over 52s) kubelet Search Line limits were exceeded, some search paths have been omitted, the applied search line is: gpu-operator.svc.cluster.local svc.cluster.local cluster.local cm.cluster brightcomputing.com idrac.cluster
I tried using other nodes images but still its not working , howz that possible .
Hi Karan, could you provide the output of ls -al /cm/images/<image_name>/var/lib/containerd
(for the relevant software image), to rule out the problem is the one I was suspecting it would be?
You can lookup the software image for the device you are focusing on (e.g., node002 in your case, cmsh -c 'device; use node002; get softwareimage'
)
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
OS/Version(e.g. RHEL8.6, Ubuntu22.04): Rocky Linux 8.6
Kernel Version: 4.18.0-372.9.1.el8.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
[root@master88 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION master88 Ready control-plane,master 141d v1.24.9 node002 Ready worker 140d v1.24.9 node003 Ready worker 140d v1.24.9 node004 Ready worker 140d v1.24.9
GPU Operator Version:
[root@master88 ~]# helm list -n gpu-operator NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator gpu-operator 1 2023-05-09 10:37:26.437913529 +0900 KST deployed gpu-operator-v23.3.1 v23.3.1
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior. Installed Nvidia GDS on one of the GPU compute nodes in a k8s cluster . Reinstalled Nvidia drivers , cuda toolkit etc . The respective node is crashing with toolkit-validation init container with error 127 , checking the logs says all validation successful . Below I will give the full trace. Other GPU nodes are fine . I tried to match the cuda drivers and toolkit version with other compute nodes but no luck after restarting nvidia-operator-validator daemonset. The K8s cluster is provisioned by using Nvidia's BCM ( Bright Cluster Manager) . Looks like a config issue , its so many moving parts , please let me know what I can look into and give a try .
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
4. Information to attach (optional if deemed irrelevant)
Nvidia-GDS
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com