NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
119 stars 47 forks source link

Cannot validate install; GPU not available #13

Closed joshwyatt closed 2 years ago

joshwyatt commented 2 years ago

Good day fellow NVIDIANs (jwyatt here)

I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with nvidia-smi.

I'm running on a g4dn.2xlarge.

In summary, the nvidia-smi pod is hanging indefinitely. kubectl pod describe pod nvidia-smi reads:

Name:         nvidia-smi
Namespace:    default
Priority:     0
Node:         <none>
Labels:       run=nvidia-smi
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nvidia-smi:
    Image:      nvidia/cuda:11.1.1-base
    Port:       <none>
    Host Port:  <none>
    Args:
      nvidia-smi
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-v5g2r (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-v5g2r:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-v5g2r
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  35s (x103 over 151m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I think the last line there is most relevant.

Describe Node

The output of kubectl describe node is:

Name:               ip-172-31-21-241
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.11.0-1020-aws
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=11
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-172-31-21-241
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
                    nvidia.com/gpu.present=true
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 172.31.21.241/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.198.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 09 Dec 2021 19:00:28 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-172-31-21-241
  AcquireTime:     <unset>
  RenewTime:       Thu, 09 Dec 2021 22:04:22 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 09 Dec 2021 19:05:06 +0000   Thu, 09 Dec 2021 19:05:06 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:05:00 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  172.31.21.241
  Hostname:    ip-172-31-21-241
Capacity:
  cpu:                8
  ephemeral-storage:  64989720Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32407904Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  59894525853
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32305504Ki
  pods:               110
System Info:
  Machine ID:                 ec218707683e25881184af68445ebd87
  System UUID:                ec218707-683e-2588-1184-af68445ebd87
  Boot ID:                    d59c4692-4971-43f2-a237-2d2fe49434cc
  Kernel Version:             5.11.0-1020-aws
  OS Image:                   Ubuntu 20.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.18.14
  Kube-Proxy Version:         v1.18.14
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     gpu-operator-1639077567-node-feature-discovery-master-84485st4x    0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-1639077567-node-feature-discovery-worker-ffljl        0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-76fb8d5c55-rq7jj                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-container-toolkit-daemonset-g4rzv                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-driver-daemonset-gsl5l                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  kube-system                 calico-kube-controllers-7f94cf5997-zr46g                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 calico-node-8w9jt                                                  250m (3%)     0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 coredns-66bff467f8-hnbb6                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 coredns-66bff467f8-msqdc                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 etcd-ip-172-31-21-241                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-apiserver-ip-172-31-21-241                                    250m (3%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-controller-manager-ip-172-31-21-241                           200m (2%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-proxy-mssxq                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-scheduler-ip-172-31-21-241                                    100m (1%)     0 (0%)      0 (0%)           0 (0%)         3h3m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1 (12%)     0 (0%)
  memory             140Mi (0%)  340Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events: 

I notice there is no nvidia-gpu listed in Allocated resources.

Thanks for your support.

angudadevops commented 2 years ago

@joshwyatt

Please provide the output of "kubectl get pods -A | grep gpu" as looks like GPU Operator didn't install properly.

Thanks Anurag G

joshwyatt commented 2 years ago

Thanks @angudadevops. I was able to get a successful installation using v4.2, and am happy to close this issue.