NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 290 forks source link

toolkit-validation CrashLoopBackOff with error 127 in DaemonSet/nvidia-operator-validator #579

Open karanveersingh5623 opened 1 year ago

karanveersingh5623 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior. Installed Nvidia GDS on one of the GPU compute nodes in a k8s cluster . Reinstalled Nvidia drivers , cuda toolkit etc . The respective node is crashing with toolkit-validation init container with error 127 , checking the logs says all validation successful . Below I will give the full trace. Other GPU nodes are fine . I tried to match the cuda drivers and toolkit version with other compute nodes but no luck after restarting nvidia-operator-validator daemonset. The K8s cluster is provisioned by using Nvidia's BCM ( Bright Cluster Manager) . Looks like a config issue , its so many moving parts , please let me know what I can look into and give a try .

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

4. Information to attach (optional if deemed irrelevant)

[root@master88 ~]# kubectl get all -n gpu-operator -o wide
NAME                                                              READY   STATUS                  RESTARTS          AGE    IP               NODE      NOMINATED NODE   READINESS GATES
pod/gpu-feature-discovery-4qpw9                                   1/1     Running                 1 (123d ago)      126d   172.29.67.209    node003   <none>           <none>
pod/gpu-feature-discovery-kzlms                                   1/1     Running                 2 (40d ago)       41d    172.29.107.179   node004   <none>           <none>
pod/gpu-feature-discovery-qr2k4                                   0/1     Init:0/1                5                 126d   172.29.112.188   node002   <none>           <none>
pod/gpu-operator-689dbf694b-frr75                                 1/1     Running                 1 (20d ago)       21d    172.29.107.191   node004   <none>           <none>
pod/gpu-operator-node-feature-discovery-master-7db9bfdd5b-tbx9t   1/1     Running                 0                 57d    172.29.67.217    node003   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-5swvb              1/1     Running                 21 (40d ago)      126d   172.29.107.169   node004   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-gl9jt              1/1     Running                 250 (14h ago)     126d   172.29.112.182   node002   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-px9qs              1/1     Running                 2 (123d ago)      126d   172.29.67.205    node003   <none>           <none>
pod/nvidia-container-toolkit-daemonset-pvh7c                      1/1     Running                 10 (40d ago)      126d   172.29.107.170   node004   <none>           <none>
pod/nvidia-container-toolkit-daemonset-qv2qz                      1/1     Running                 7 (14h ago)       126d   172.29.112.184   node002   <none>           <none>
pod/nvidia-container-toolkit-daemonset-sg7d4                      1/1     Running                 1 (123d ago)      126d   172.29.67.206    node003   <none>           <none>
pod/nvidia-cuda-validator-5g97v                                   0/1     Completed               0                 123d   172.29.67.213    node003   <none>           <none>
pod/nvidia-cuda-validator-grmsq                                   0/1     Completed               0                 17h    172.29.107.162   node004   <none>           <none>
pod/nvidia-dcgm-exporter-6q85r                                    0/1     Init:0/1                3                 126d   172.29.112.185   node002   <none>           <none>
pod/nvidia-dcgm-exporter-79qbc                                    1/1     Running                 2 (40d ago)       41d    172.29.107.181   node004   <none>           <none>
pod/nvidia-dcgm-exporter-wwmg8                                    1/1     Running                 1 (123d ago)      126d   172.29.67.211    node003   <none>           <none>
pod/nvidia-device-plugin-daemonset-6b5sh                          0/1     Init:0/1                5                 126d   172.29.112.186   node002   <none>           <none>
pod/nvidia-device-plugin-daemonset-wllg9                          1/1     Running                 1 (123d ago)      126d   172.29.67.210    node003   <none>           <none>
pod/nvidia-device-plugin-daemonset-zzlrn                          1/1     Running                 8 (40d ago)       41d    172.29.107.176   node004   <none>           <none>
pod/nvidia-device-plugin-validator-dzbmk                          0/1     Completed               0                 17h    172.29.107.143   node004   <none>           <none>
pod/nvidia-device-plugin-validator-nk5hs                          0/1     Completed               0                 123d   172.29.67.214    node003   <none>           <none>
pod/nvidia-mig-manager-csp5q                                      1/1     Running                 10 (40d ago)      126d   172.29.107.178   node004   <none>           <none>
pod/nvidia-mig-manager-k9fg7                                      1/1     Running                 1 (123d ago)      126d   172.29.67.207    node003   <none>           <none>
pod/nvidia-mig-manager-sgjrv                                      0/1     Init:0/1                3                 126d   172.29.112.187   node002   <none>           <none>
pod/nvidia-operator-validator-df5wt                               1/1     Running                 1 (123d ago)      126d   172.29.67.208    node003   <none>           <none>
pod/nvidia-operator-validator-l2p97                               0/1     Init:CrashLoopBackOff   171 (2m33s ago)   17h    172.29.112.183   node002   <none>           <none>
pod/nvidia-operator-validator-pzftd                               1/1     Running                 0                 17h    172.29.107.157   node004   <none>           <none>

NAME                                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE    SELECTOR
service/gpu-operator                                 ClusterIP   10.150.193.177   <none>        8080/TCP   126d   app=gpu-operator
service/gpu-operator-node-feature-discovery-master   ClusterIP   10.150.104.162   <none>        8080/TCP   126d   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master
service/nvidia-dcgm-exporter                         ClusterIP   10.150.88.253    <none>        9400/TCP   126d   app=nvidia-dcgm-exporter

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE    CONTAINERS                     IMAGES                                                           SELECTOR
daemonset.apps/gpu-feature-discovery                        3         3         2       3            2           nvidia.com/gpu.deploy.gpu-feature-discovery=true   126d   gpu-feature-discovery          nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8                 app=gpu-feature-discovery,app.kubernetes.io/part-of=nvidia-gpu
daemonset.apps/gpu-operator-node-feature-discovery-worker   3         3         3       3            3           <none>                                             126d   worker                         registry.k8s.io/nfd/node-feature-discovery:v0.12.1               app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=worker
daemonset.apps/nvidia-container-toolkit-daemonset           3         3         3       3            3           nvidia.com/gpu.deploy.container-toolkit=true       126d   nvidia-container-toolkit-ctr   nvcr.io/nvidia/k8s/container-toolkit:v1.13.0-ubuntu20.04         app=nvidia-container-toolkit-daemonset
daemonset.apps/nvidia-dcgm-exporter                         3         3         2       3            2           nvidia.com/gpu.deploy.dcgm-exporter=true           126d   nvidia-dcgm-exporter           nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04         app=nvidia-dcgm-exporter
daemonset.apps/nvidia-device-plugin-daemonset               3         3         2       3            2           nvidia.com/gpu.deploy.device-plugin=true           126d   nvidia-device-plugin           nvcr.io/nvidia/k8s-device-plugin:v0.14.0-ubi8                    app=nvidia-device-plugin-daemonset
daemonset.apps/nvidia-mig-manager                           3         3         2       3            2           nvidia.com/gpu.deploy.mig-manager=true             126d   nvidia-mig-manager             nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.2-ubuntu20.04   app=nvidia-mig-manager
daemonset.apps/nvidia-operator-validator                    3         3         2       2            2           nvidia.com/gpu.deploy.operator-validator=true      126d   nvidia-operator-validator      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1       app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS     IMAGES                                               SELECTOR
deployment.apps/gpu-operator                                 1/1     1            1           126d   gpu-operator   nvcr.io/nvidia/gpu-operator:v23.3.1                  app=gpu-operator,app.kubernetes.io/component=gpu-operator
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           126d   master         registry.k8s.io/nfd/node-feature-discovery:v0.12.1   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master

NAME                                                                    DESIRED   CURRENT   READY   AGE    CONTAINERS     IMAGES                                               SELECTOR
replicaset.apps/gpu-operator-689dbf694b                                 1         1         1       126d   gpu-operator   nvcr.io/nvidia/gpu-operator:v23.3.1                  app=gpu-operator,app.kubernetes.io/component=gpu-operator,pod-template-hash=689dbf694b
replicaset.apps/gpu-operator-node-feature-discovery-master-7db9bfdd5b   1         1         1       126d   master         registry.k8s.io/nfd/node-feature-discovery:v0.12.1   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=7db9bfdd5b,role=master
[root@master88 ~]# kubectl describe pod nvidia-operator-validator-l2p97 -n gpu-operator
Name:                 nvidia-operator-validator-l2p97
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 node002/192.168.61.92
Start Time:           Tue, 12 Sep 2023 16:06:24 +0900
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=84b69d5dbb
                      helm.sh/chart=gpu-operator-v23.3.1
                      pod-template-generation=2
Annotations:          cni.projectcalico.org/containerID: a6baa38b48991dac3a5735799cafe4fa88698f537b4999bc2f79e24537f3523e
                      cni.projectcalico.org/podIP: 172.29.112.183/32
                      cni.projectcalico.org/podIPs: 172.29.112.183/32
                      kubectl.kubernetes.io/restartedAt: 2023-09-12T16:06:23+09:00
Status:               Pending
IP:                   172.29.112.183
IPs:
  IP:           172.29.112.183
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://7123c29f431c964ef92782ff40fe8589aba5474dfe831009d4f612856273ded8
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:72b7988ef65359feb5c7b77f2c1d0c4060e40686a244d6a622f9fd085cdb11ec
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Sep 2023 19:16:30 +0900
      Finished:     Tue, 12 Sep 2023 19:16:41 +0900
    Ready:          True
    Restart Count:  1
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2qmnj (ro)
  toolkit-validation:
    Container ID:  containerd://84eecbcafc650de9e6d7ce0aac27368dae0ec744a5c1425fb390826d706b10c6
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:72b7988ef65359feb5c7b77f2c1d0c4060e40686a244d6a622f9fd085cdb11ec
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Wed, 13 Sep 2023 09:44:17 +0900
      Finished:     Wed, 13 Sep 2023 09:44:17 +0900
    Ready:          False
    Restart Count:  170
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2qmnj (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2qmnj (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                true
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2qmnj (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2qmnj (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-2qmnj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age                     From     Message
  ----     ------            ----                    ----     -------
  Normal   Started           34m (x163 over 14h)     kubelet  Started container toolkit-validation
  Warning  DNSConfigForming  4m39s (x4055 over 14h)  kubelet  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: gpu-operator.svc.cluster.local svc.cluster.local cluster.local cm.cluster brightcomputing.com idrac.cluster
[root@master88 ~]#
[root@master88 ~]# kubectl logs nvidia-operator-validator-l2p97 -n gpu-operator
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-l2p97" is waiting to start: PodInitializing
[root@node002 ~]# nvidia-smi
Wed Sep 13 10:09:58 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   39C    P0    64W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   39C    P0    65W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   37C    P0    66W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:E3:00.0 Off |                  Off |
| N/A   39C    P0    65W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Nvidia-GDS

[root@node002 ~]# /usr/local/cuda-11.7/gds/tools/gdscheck.py -p
 GDS release version: 1.3.1.18
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Supported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.lustre.rdma_dev_addr_list : 192.168.61.92
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

karanveersingh5623 commented 1 year ago

@shivamerla below is log from init container

[root@master88 ~]# kubectl -n gpu-operator logs nvidia-operator-validator-l2p97 toolkit-validation
Inconsistency detected by ld.so: dl-call-libc-early-init.c: 37: _dl_call_libc_early_init: Assertion `sym != NULL' failed!
karanveersingh5623 commented 1 year ago

@shivamerla another GPU node which is working

[root@master88 ~]# kubectl -n gpu-operator logs pod/nvidia-operator-validator-pzftd toolkit-validation
Tue Sep 12 07:06:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:17:00.0 Off |                    0 |
| N/A   36C    P0    46W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   40C    P0    45W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E3:00.0 Off |                    0 |
| N/A   38C    P0    44W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
karanveersingh5623 commented 1 year ago

@shivamerla , sent must-gather logs by email , please check

nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8

nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1

nvcr.io/nvidia/k8s/container-toolkit:v1.13.0-ubuntu20.04

registry.k8s.io/nfd/node-feature-discovery:v0.12.1

nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04

nvcr.io/nvidia/k8s-device-plugin:v0.14.0-ubi8

nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.2-ubuntu20.04

nvcr.io/nvidia/gpu-operator:v23.3.1

nvcr.io/nvidia/gpu-operator:devel-ubi8

nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.1

nvcr.io/nvidia/cloud-native/nvidia-fs:2.15.1

rayburgemeestre commented 1 year ago

Hi Karan,

You mentioned this is a BCM cluster, one thing that could explain the issue you are seeing is a corrupt /var/lib/containerd directory. Is it possible that a grabimage was performed on the software image of the malfunctioning node? (likely from a node other than node002 in your case?)

In the software image please make sure that /var/lib/containerd/ is empty (full path something like: /cm/images/<image_name>/var/lib/containerd). If it is not, please empty it, and reprovision the broken node(s).

The reason I'm suggesting this is the last time I saw that error message (Inconsistency detected by ld.so: dl-call-libc-early-init.c: 37: _dl_call_libc_early_init: Assertion `sym != NULL' failed!) this was the issue.

Ray

karanveersingh5623 commented 1 year ago

@rayburgemeestre

I tried using image that was working before but idk what is happening . please check the trace below.

[root@master88 gpu-operator-23.3.1]# kubectl get all -n gpu-operator -o wide
NAME                                                              READY   STATUS              RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
pod/gpu-feature-discovery-2svjf                                   0/1     Init:0/1            0          1s    <none>           node003   <none>           <none>
pod/gpu-feature-discovery-hjwsf                                   0/1     Pending             0          1s    <none>           node002   <none>           <none>
pod/gpu-feature-discovery-j96kv                                   0/1     Pending             0          1s    <none>           node004   <none>           <none>
pod/gpu-operator-8c64dfbbc-hfd89                                  1/1     Running             0          20s   172.29.107.152   node004   <none>           <none>
pod/gpu-operator-node-feature-discovery-master-7db9bfdd5b-fztf4   1/1     Running             0          20s   172.29.107.142   node004   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-6wx75              1/1     Running             0          20s   172.29.67.250    node003   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-qqbfj              0/1     ContainerCreating   0          20s   <none>           node002   <none>           <none>
pod/gpu-operator-node-feature-discovery-worker-tp2p7              1/1     Running             0          20s   172.29.107.167   node004   <none>           <none>
pod/nvidia-container-toolkit-daemonset-rj5p8                      0/1     Init:0/1            0          2s    <none>           node002   <none>           <none>
pod/nvidia-container-toolkit-daemonset-z8lzx                      0/1     Init:0/1            0          2s    <none>           node003   <none>           <none>
pod/nvidia-container-toolkit-daemonset-zqb2q                      0/1     Init:0/1            0          2s    <none>           node004   <none>           <none>
pod/nvidia-dcgm-exporter-24wkh                                    0/1     Init:0/1            0          2s    <none>           node003   <none>           <none>
pod/nvidia-dcgm-exporter-7ppw6                                    0/1     Init:0/1            0          2s    <none>           node002   <none>           <none>
pod/nvidia-dcgm-exporter-gpz2s                                    0/1     Init:0/1            0          2s    <none>           node004   <none>           <none>
pod/nvidia-device-plugin-daemonset-2bqgf                          0/1     Init:0/1            0          2s    <none>           node003   <none>           <none>
pod/nvidia-device-plugin-daemonset-5npqz                          0/1     Init:0/1            0          2s    <none>           node002   <none>           <none>
pod/nvidia-device-plugin-daemonset-s2jml                          0/1     Init:0/1            0          2s    <none>           node004   <none>           <none>
pod/nvidia-driver-daemonset-6hrbg                                 0/1     Init:0/1            0          2s    <none>           node002   <none>           <none>
pod/nvidia-mig-manager-24ln2                                      0/1     Pending             0          1s    <none>           node004   <none>           <none>
pod/nvidia-mig-manager-j8wtr                                      0/1     Pending             0          1s    <none>           node002   <none>           <none>
pod/nvidia-mig-manager-x7nht                                      0/1     Pending             0          1s    <none>           node003   <none>           <none>
pod/nvidia-operator-validator-2qllx                               0/1     Init:0/4            0          2s    <none>           node002   <none>           <none>
pod/nvidia-operator-validator-n72vf                               0/1     Init:0/4            0          2s    <none>           node003   <none>           <none>
pod/nvidia-operator-validator-szmvg                               0/1     Init:0/4            0          2s    <none>           node004   <none>           <none>

NAME                                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE   SELECTOR
service/gpu-operator                                 ClusterIP   10.150.116.59   <none>        8080/TCP   2s    app=gpu-operator
service/gpu-operator-node-feature-discovery-master   ClusterIP   10.150.24.30    <none>        8080/TCP   20s   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master
service/nvidia-dcgm-exporter                         ClusterIP   10.150.22.18    <none>        9400/TCP   2s    app=nvidia-dcgm-exporter

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE   CONTAINERS                     IMAGES                                                           SELECTOR
daemonset.apps/gpu-feature-discovery                        3         3         0       3            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   2s    gpu-feature-discovery          nvcr.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8                 app=gpu-feature-discovery,app.kubernetes.io/part-of=nvidia-gpu
daemonset.apps/gpu-operator-node-feature-discovery-worker   3         3         2       3            2           <none>                                             20s   worker                         registry.k8s.io/nfd/node-feature-discovery:v0.12.1               app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=worker
daemonset.apps/nvidia-container-toolkit-daemonset           3         3         0       3            0           nvidia.com/gpu.deploy.container-toolkit=true       2s    nvidia-container-toolkit-ctr   nvcr.io/nvidia/k8s/container-toolkit:v1.13.0-ubuntu20.04         app=nvidia-container-toolkit-daemonset
daemonset.apps/nvidia-dcgm-exporter                         3         3         0       3            0           nvidia.com/gpu.deploy.dcgm-exporter=true           2s    nvidia-dcgm-exporter           nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04         app=nvidia-dcgm-exporter
daemonset.apps/nvidia-device-plugin-daemonset               3         3         0       3            0           nvidia.com/gpu.deploy.device-plugin=true           2s    nvidia-device-plugin           nvcr.io/nvidia/k8s-device-plugin:v0.14.0-ubi8                    app=nvidia-device-plugin-daemonset
daemonset.apps/nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  2s    nvidia-driver-ctr              nvcr.io/nvidia/driver:525.105.17-rocky8.6                        app=nvidia-driver-daemonset
daemonset.apps/nvidia-mig-manager                           3         3         0       3            0           nvidia.com/gpu.deploy.mig-manager=true             2s    nvidia-mig-manager             nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.2-ubuntu20.04   app=nvidia-mig-manager
daemonset.apps/nvidia-operator-validator                    3         3         0       3            0           nvidia.com/gpu.deploy.operator-validator=true      2s    nvidia-operator-validator      nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1       app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS     IMAGES                                               SELECTOR
deployment.apps/gpu-operator                                 1/1     1            1           20s   gpu-operator   nvcr.io/nvidia/gpu-operator:v23.3.1                  app=gpu-operator,app.kubernetes.io/component=gpu-operator
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           20s   master         registry.k8s.io/nfd/node-feature-discovery:v0.12.1   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,role=master

NAME                                                                    DESIRED   CURRENT   READY   AGE   CONTAINERS     IMAGES                                               SELECTOR
replicaset.apps/gpu-operator-8c64dfbbc                                  1         1         1       20s   gpu-operator   nvcr.io/nvidia/gpu-operator:v23.3.1                  app=gpu-operator,app.kubernetes.io/component=gpu-operator,pod-template-hash=8c64dfbbc
replicaset.apps/gpu-operator-node-feature-discovery-master-7db9bfdd5b   1         1         1       20s   master         registry.k8s.io/nfd/node-feature-discovery:v0.12.1   app.kubernetes.io/instance=gpu-operator,app.kubernetes.io/name=node-feature-discovery,pod-template-hash=7db9bfdd5b,role=master

[root@master88 gpu-operator-23.3.1]# kubectl describe pod gpu-operator-node-feature-discovery-worker-qqbfj -n gpu-operator
Name:           gpu-operator-node-feature-discovery-worker-qqbfj
Namespace:      gpu-operator
Priority:       0
Node:           node002/192.168.61.92
Start Time:     Mon, 18 Sep 2023 11:46:31 +0900
Labels:         app.kubernetes.io/instance=gpu-operator
                app.kubernetes.io/name=node-feature-discovery
                controller-revision-hash=67b4854db8
                pod-template-generation=1
                role=worker
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/gpu-operator-node-feature-discovery-worker
Containers:
  worker:
    Container ID:
    Image:         registry.k8s.io/nfd/node-feature-discovery:v0.12.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      nfd-worker
    Args:
      --server=gpu-operator-node-feature-discovery-master:8080
      -enable-nodefeature-api
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kubernetes/node-feature-discovery from nfd-worker-conf (ro)
      /etc/kubernetes/node-feature-discovery/features.d/ from features-d (ro)
      /etc/kubernetes/node-feature-discovery/source.d/ from source-d (ro)
      /host-boot from host-boot (ro)
      /host-etc/os-release from host-os-release (ro)
      /host-sys from host-sys (ro)
      /host-usr/lib from host-usr-lib (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lj6vg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  host-boot:
    Type:          HostPath (bare host directory volume)
    Path:          /boot
    HostPathType:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:
  host-usr-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib
    HostPathType:
  source-d:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/node-feature-discovery/source.d/
    HostPathType:
  features-d:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/node-feature-discovery/features.d/
    HostPathType:
  nfd-worker-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      gpu-operator-node-feature-discovery-worker-conf
    Optional:  false
  kube-api-access-lj6vg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               53s               default-scheduler  Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-qqbfj to node002
  Warning  FailedCreatePodSandBox  32s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "registry.k8s.io/pause:3.6": failed to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.6": dial tcp: lookup registry.k8s.io on 192.168.61.88:53: read udp 192.168.61.92:55766->192.168.61.88:53: i/o timeout
  Warning  FailedCreatePodSandBox  12s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "registry.k8s.io/pause:3.6": failed to pull image "registry.k8s.io/pause:3.6": failed to pull and unpack image "registry.k8s.io/pause:3.6": failed to resolve reference "registry.k8s.io/pause:3.6": failed to do request: Head "https://registry.k8s.io/v2/pause/manifests/3.6": dial tcp: lookup registry.k8s.io on 192.168.61.88:53: server misbehaving
  Warning  DNSConfigForming        1s (x3 over 52s)  kubelet            Search Line limits were exceeded, some search paths have been omitted, the applied search line is: gpu-operator.svc.cluster.local svc.cluster.local cluster.local cm.cluster brightcomputing.com idrac.cluster
karanveersingh5623 commented 1 year ago

I tried using other nodes images but still its not working , howz that possible .

rayburgemeestre commented 1 year ago

Hi Karan, could you provide the output of ls -al /cm/images/<image_name>/var/lib/containerd (for the relevant software image), to rule out the problem is the one I was suspecting it would be?

You can lookup the software image for the device you are focusing on (e.g., node002 in your case, cmsh -c 'device; use node002; get softwareimage')