Open garyyang6 opened 2 years ago
In Kubernetes 1.23 containerd is the default runtime in use. Have you configured containerd to use the nvidia-container-runtime as its default runtime?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Issue or feature description
In EKS (1.23), I launched an EC2 instance (Ubuntu) with the instance type G5.2xlarge. However, GPU is not available.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"
2. Steps to reproduce the issue
I enabled GPU support by deploying the nvidia-device-plugin-daemonset kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
Deploy a pod.
Login to this Ubuntu EC2 instance. I execute command as follows. It shows that there is one GPU with this instance.
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host sudo nvidia-smi -aTimestamp : Tue Nov 15 01:06:47 2022 Driver Version : 510.85.02 CUDA Version : 11.6
Attached GPUs : 1 GPU 00000000:00:1E.0 Product Name : NVIDIA A10G Product Brand : NVIDIA RTX Product Architecture : Ampere Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1321321008039 GPU UUID : GPU-2600e701-8d2f-704c-06bd-ca16a9306dfe Minor Number : 0 VBIOS Version : 94.02.75.00.01 MultiGPU Board : No Board ID : 0x1e GPU Part Number : 900-2G133-A840-000 Module ID : 0 Inforom Version Image Version : G133.0210.00.04 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA RTX Virtual Workstation License Status : Licensed (Expiry: N/A) IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x1E Domain : 0x0000 Device Id : 0x223710DE Bus Id : 00000000:00:1E.0 Sub System Id : 0x152F10DE GPU Link Info PCIe Generation Max : 4 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 23028 MiB Reserved : 296 MiB Used : 0 MiB Free : 22731 MiB BAR1 Memory Usage Total : 32768 MiB Used : 1 MiB Free : 32767 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 192 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 12 C GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 88 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 17.71 W Power Limit : 300.00 W Default Power Limit : 300.00 W Enforced Power Limit : 300.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : 1710 MHz Memory : 6251 MHz Default Applications Clocks Graphics : 1710 MHz Memory : 6251 MHz Max Clocks Graphics : 1710 MHz SM : 1710 MHz Memory : 6251 MHz Video : 1500 MHz Max Customer Boost Clocks Graphics : 1710 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 700.000 mV Processes : None
$ sudo cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
$ sudo journalctl -r -u kubelet -- Logs begin at Mon 2022-11-14 23:28:14 UTC, end at Tue 2022-11-15 01:12:27 UTC. -- -- No entries --
Client: Docker Engine - Community Version: 20.10.21 API version: 1.41 Go version: go1.18.7 Git commit: baeda1f Built: Tue Oct 25 18:02:21 2022 OS/Arch: linux/amd64 Context: default Experimental: true
Server: Docker Engine - Community Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 3056208 Built: Tue Oct 25 18:00:04 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.9 GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f nvidia: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0
uname -a Linux ip-10-2-1-197 5.15.0-1022-aws #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
dpkg -l 'nvidia' (no description available)
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base
un nvidia-docker (no description available)
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
_or_
rpm -qa 'nvidia' sh: 1: or: not found dpkg-query: no packages found matching nvidiarpm dpkg-query: no packages found matching -qa Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== un libgldispatch0-nvidianvidia-container-cli -V cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-09-06T09:21+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections $