4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

run nvidia-smi err in pod #25

Closed chenyangxueHDU closed 2 years ago

chenyangxueHDU commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

create pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: xx:runtime-py3.6-cudnn7.3-cuda9.2-centos7
      command: 
        - /bin/bash
        - -c
        - sleep 1d
      env:
        - name: LIBCUDA_LOG_LEVEL
          value: "5"
      resources:
        limits:
          nvidia.com/gpu: 2

run nvidia-smi err:

[root@gpu-pod /]# nvidia-smi
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceGetMemoryInfo_v2 in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlDeviceSetTemperatureThreshold in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:396)]: can't find function nvmlVgpuInstanceGetGpuInstanceId in libnvidia-ml.so.1
[4pdvGPU Warn(38:140500277225280:hook.c:339)]: NVML error at line 339: 1
Failed to initialize NVML: Unknown Error

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

==============NVSMI LOG==============

Timestamp : Sat Aug 13 16:10:39 2022 Driver Version : 455.38 CUDA Version : 11.1

Attached GPUs : 4 GPU 00000000:02:00.0 Product Name : TITAN V Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0320618057831 GPU UUID : GPU-f5a3f95f-2685-cf01-2063-7bc624963433 Minor Number : 0 VBIOS Version : 88.00.41.00.18 MultiGPU Board : No Board ID : 0x200 GPU Part Number : 900-1G500-2500-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1D8110DE Bus Id : 00000000:02:00.0 Sub System Id : 0x121810DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 28 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 12066 MiB Used : 0 MiB Free : 12066 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 39 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 91 C Memory Current Temp : 36 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 27.28 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 135 MHz SM : 135 MHz Memory : 850 MHz Video : 555 MHz Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Default Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Max Clocks Graphics : 1912 MHz SM : 1912 MHz Memory : 850 MHz Video : 1717 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:03:00.0 Product Name : TITAN V Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0320618058217 GPU UUID : GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61 Minor Number : 1 VBIOS Version : 88.00.41.00.18 MultiGPU Board : No Board ID : 0x300 GPU Part Number : 900-1G500-2500-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x03 Device : 0x00 Domain : 0x0000 Device Id : 0x1D8110DE Bus Id : 00000000:03:00.0 Sub System Id : 0x121810DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 31 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 12066 MiB Used : 0 MiB Free : 12066 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 45 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 91 C Memory Current Temp : 43 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 30.67 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 135 MHz SM : 135 MHz Memory : 850 MHz Video : 555 MHz Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Default Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Max Clocks Graphics : 1912 MHz SM : 1912 MHz Memory : 850 MHz Video : 1717 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:82:00.0 Product Name : TITAN V Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0324917182924 GPU UUID : GPU-e2336a65-b527-8ba6-c005-209ebc071c78 Minor Number : 2 VBIOS Version : 88.00.36.00.01 MultiGPU Board : No Board ID : 0x8200 GPU Part Number : 900-1G500-2500-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x82 Device : 0x00 Domain : 0x0000 Device Id : 0x1D8110DE Bus Id : 00000000:82:00.0 Sub System Id : 0x121810DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 28 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 12066 MiB Used : 0 MiB Free : 12066 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 42 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 91 C Memory Current Temp : 38 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 27.65 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 135 MHz SM : 135 MHz Memory : 850 MHz Video : 555 MHz Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Default Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Max Clocks Graphics : 1912 MHz SM : 1912 MHz Memory : 850 MHz Video : 1717 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:83:00.0 Product Name : TITAN V Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0320618057916 GPU UUID : GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348 Minor Number : 3 VBIOS Version : 88.00.41.00.18 MultiGPU Board : No Board ID : 0x8300 GPU Part Number : 900-1G500-2500-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x83 Device : 0x00 Domain : 0x0000 Device Id : 0x1D8110DE Bus Id : 00000000:83:00.0 Sub System Id : 0x121810DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 28 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 12066 MiB Used : 0 MiB Free : 12066 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 41 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 91 C Memory Current Temp : 40 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 25.91 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 135 MHz SM : 135 MHz Memory : 850 MHz Video : 555 MHz Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Default Applications Clocks Graphics : 1200 MHz Memory : 850 MHz Max Clocks Graphics : 1912 MHz SM : 1912 MHz Memory : 850 MHz Video : 1717 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)

{ "init": true, "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

 - [ ] The k8s-device-plugin container [logs](url)

2022/08/13 07:41:28 Starting FS watcher. 2022/08/13 07:41:28 Starting OS watcher. 2022/08/13 07:41:28 Retreiving plugins. 2022/08/13 07:41:28 migstrategy= none 2022/08/13 07:41:28 uuid= GPU-f5a3f95f-2685-cf01-2063-7bc624963433 2022/08/13 07:41:28 uuid= GPU-3d5859fc-73e1-23d1-2e59-78c5e7049d61 2022/08/13 07:41:28 uuid= GPU-e2336a65-b527-8ba6-c005-209ebc071c78 2022/08/13 07:41:28 uuid= GPU-5908bbc9-ddab-ebe7-5624-446b3fc15348 2022/08/13 07:41:28 Starting GRPC server for 'nvidia.com/gpu' 2022/08/13 07:41:28 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock 2022/08/13 07:41:28 Registered device plugin for 'nvidia.com/gpu' with Kubelet

 - [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`

Client: Docker Engine - Community Version: 19.03.3 API version: 1.40 Go version: go1.12.10 Git commit: a872fc2f86 Built: Tue Oct 8 00:58:10 2019 OS/Arch: linux/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 19.03.13 API version: 1.40 (minimum version 1.12) Go version: go1.13.15 Git commit: 4484c46d9d Built: Wed Sep 16 17:02:21 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.10 GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339 nvidia: Version: 1.0.0-rc8+dev GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657 docker-init: Version: 0.18.0 GitCommit: fec3683

 - [ ] Docker command, image and tag used
 - [ ] Kernel version from `uname -a`
Linux xxx 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
 - [ ] Any relevant kernel output lines from `dmesg`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.0.0 build date: 2018-03-06T02:05+0000 build revision: be797da00b156493e80f1ae6f38d69f23c932554 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))