Closed joshwyatt closed 2 years ago
Good day fellow NVIDIANs (jwyatt here)
I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with nvidia-smi.
nvidia-smi
I'm running on a g4dn.2xlarge.
In summary, the nvidia-smi pod is hanging indefinitely. kubectl pod describe pod nvidia-smi reads:
kubectl pod describe pod nvidia-smi
Name: nvidia-smi Namespace: default Priority: 0 Node: <none> Labels: run=nvidia-smi Annotations: <none> Status: Pending IP: IPs: <none> Containers: nvidia-smi: Image: nvidia/cuda:11.1.1-base Port: <none> Host Port: <none> Args: nvidia-smi Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-v5g2r (ro) Conditions: Type Status PodScheduled False Volumes: default-token-v5g2r: Type: Secret (a volume populated by a Secret) SecretName: default-token-v5g2r Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 35s (x103 over 151m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
I think the last line there is most relevant.
The output of kubectl describe node is:
kubectl describe node
Name: ip-172-31-21-241 Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true feature.node.kubernetes.io/cpu-cpuid.AVX512F=true feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.MPX=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true feature.node.kubernetes.io/kernel-version.full=5.11.0-1020-aws feature.node.kubernetes.io/kernel-version.major=5 feature.node.kubernetes.io/kernel-version.minor=11 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-1d0f.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-172-31-21-241 kubernetes.io/os=linux node-role.kubernetes.io/master= nvidia.com/gpu.present=true Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-... nfd.node.kubernetes.io/master.version: v0.6.0 nfd.node.kubernetes.io/worker.version: v0.6.0 node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 172.31.21.241/20 projectcalico.org/IPv4IPIPTunnelAddr: 192.168.198.128 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 09 Dec 2021 19:00:28 +0000 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-172-31-21-241 AcquireTime: <unset> RenewTime: Thu, 09 Dec 2021 22:04:22 +0000 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Thu, 09 Dec 2021 19:05:06 +0000 Thu, 09 Dec 2021 19:05:06 +0000 CalicoIsUp Calico is running on this node MemoryPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:00:27 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Thu, 09 Dec 2021 22:01:00 +0000 Thu, 09 Dec 2021 19:05:00 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: 172.31.21.241 Hostname: ip-172-31-21-241 Capacity: cpu: 8 ephemeral-storage: 64989720Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32407904Ki pods: 110 Allocatable: cpu: 8 ephemeral-storage: 59894525853 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32305504Ki pods: 110 System Info: Machine ID: ec218707683e25881184af68445ebd87 System UUID: ec218707-683e-2588-1184-af68445ebd87 Boot ID: d59c4692-4971-43f2-a237-2d2fe49434cc Kernel Version: 5.11.0-1020-aws OS Image: Ubuntu 20.04.3 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://19.3.13 Kubelet Version: v1.18.14 Kube-Proxy Version: v1.18.14 PodCIDR: 192.168.0.0/24 PodCIDRs: 192.168.0.0/24 Non-terminated Pods: (14 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- default gpu-operator-1639077567-node-feature-discovery-master-84485st4x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m default gpu-operator-1639077567-node-feature-discovery-worker-ffljl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m default gpu-operator-76fb8d5c55-rq7jj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m gpu-operator-resources nvidia-container-toolkit-daemonset-g4rzv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m gpu-operator-resources nvidia-driver-daemonset-gsl5l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 164m kube-system calico-kube-controllers-7f94cf5997-zr46g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 179m kube-system calico-node-8w9jt 250m (3%) 0 (0%) 0 (0%) 0 (0%) 179m kube-system coredns-66bff467f8-hnbb6 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3h3m kube-system coredns-66bff467f8-msqdc 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3h3m kube-system etcd-ip-172-31-21-241 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h3m kube-system kube-apiserver-ip-172-31-21-241 250m (3%) 0 (0%) 0 (0%) 0 (0%) 3h3m kube-system kube-controller-manager-ip-172-31-21-241 200m (2%) 0 (0%) 0 (0%) 0 (0%) 3h3m kube-system kube-proxy-mssxq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h3m kube-system kube-scheduler-ip-172-31-21-241 100m (1%) 0 (0%) 0 (0%) 0 (0%) 3h3m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1 (12%) 0 (0%) memory 140Mi (0%) 340Mi (1%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events:
I notice there is no nvidia-gpu listed in Allocated resources.
nvidia-gpu
Allocated resources
Thanks for your support.
@joshwyatt
Please provide the output of "kubectl get pods -A | grep gpu" as looks like GPU Operator didn't install properly.
Thanks Anurag G
Thanks @angudadevops. I was able to get a successful installation using v4.2, and am happy to close this issue.
Good day fellow NVIDIANs (jwyatt here)
I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with
nvidia-smi
.I'm running on a g4dn.2xlarge.
In summary, the
nvidia-smi
pod is hanging indefinitely.kubectl pod describe pod nvidia-smi
reads:I think the last line there is most relevant.
Describe Node
The output of
kubectl describe node
is:I notice there is no
nvidia-gpu
listed inAllocated resources
.Thanks for your support.