NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.83k stars 629 forks source link

GPU is not available with a GPU EC2 instance in EKS cluster (1.23) #344

Open garyyang6 opened 2 years ago

garyyang6 commented 2 years ago

1. Issue or feature description

In EKS (1.23), I launched an EC2 instance (Ubuntu) with the instance type G5.2xlarge. However, GPU is not available.

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"

NAME                                         GPU
ip-10-2-1-197.us-west-2.compute.internal   <none>

2. Steps to reproduce the issue

I enabled GPU support by deploying the nvidia-device-plugin-daemonset kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

Deploy a pod.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    node.kubernetes.io/instance-type: g5.2xlarge

Login to this Ubuntu EC2 instance. I execute command as follows. It shows that there is one GPU with this instance.

sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Tue Nov 15 01:00:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   12C    P8    14W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Timestamp : Tue Nov 15 01:06:47 2022 Driver Version : 510.85.02 CUDA Version : 11.6

Attached GPUs : 1 GPU 00000000:00:1E.0 Product Name : NVIDIA A10G Product Brand : NVIDIA RTX Product Architecture : Ampere Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1321321008039 GPU UUID : GPU-2600e701-8d2f-704c-06bd-ca16a9306dfe Minor Number : 0 VBIOS Version : 94.02.75.00.01 MultiGPU Board : No Board ID : 0x1e GPU Part Number : 900-2G133-A840-000 Module ID : 0 Inforom Version Image Version : G133.0210.00.04 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA RTX Virtual Workstation License Status : Licensed (Expiry: N/A) IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x1E Domain : 0x0000 Device Id : 0x223710DE Bus Id : 00000000:00:1E.0 Sub System Id : 0x152F10DE GPU Link Info PCIe Generation Max : 4 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 23028 MiB Reserved : 296 MiB Used : 0 MiB Free : 22731 MiB BAR1 Memory Usage Total : 32768 MiB Used : 1 MiB Free : 32767 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 192 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 12 C GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 88 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 17.71 W Power Limit : 300.00 W Default Power Limit : 300.00 W Enforced Power Limit : 300.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : 1710 MHz Memory : 6251 MHz Default Applications Clocks Graphics : 1710 MHz Memory : 6251 MHz Max Clocks Graphics : 1710 MHz SM : 1710 MHz Memory : 6251 MHz Video : 1500 MHz Max Customer Boost Clocks Graphics : 1710 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 700.000 mV Processes : None

 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)
 sudo cat /etc/docker/daemon.json

$ sudo cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }


 - [ ] The k8s-device-plugin container logs
 How to get k8s-device-plugin container logs?
 - [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

$ sudo journalctl -r -u kubelet -- Logs begin at Mon 2022-11-14 23:28:14 UTC, end at Tue 2022-11-15 01:12:27 UTC. -- -- No entries --


Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`

sudo docker version

Client: Docker Engine - Community Version: 20.10.21 API version: 1.41 Go version: go1.18.7 Git commit: baeda1f Built: Tue Oct 25 18:02:21 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 3056208 Built: Tue Oct 25 18:00:04 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.9 GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f nvidia: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [ ] Docker command, image and tag used
 - [ ] Kernel version from `uname -a`

uname -a Linux ip-10-2-1-197 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux


 - [ ] Any relevant kernel output lines from `dmesg`
 No clue what info I should provide. 
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

dpkg -l 'nvidia'_or_rpm -qa 'nvidia' sh: 1: or: not found dpkg-query: no packages found matching nvidiarpm dpkg-query: no packages found matching -qa Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== un libgldispatch0-nvidia (no description available) ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library un nvidia-container-runtime (no description available) un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base un nvidia-docker (no description available) ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper


 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

nvidia-container-cli -V cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-09-06T09:21+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections $



 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
klueska commented 1 year ago

In Kubernetes 1.23 containerd is the default runtime in use. Have you configured containerd to use the nvidia-container-runtime as its default runtime?

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.