NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.82k stars 627 forks source link

nvidia-device-plugin locks GPU_0 to P0 at idle instead of P8 #110

Closed another-pjohnson closed 7 months ago

another-pjohnson commented 5 years ago

1. Issue or feature description

I noticed that nvidia-device-plugin daemon will keep my gpu_0 in a P0 state, which means for the RTX Titan, it will run at 1350 MHz even when the system is idle and nothing is running on the GPU.

Disabling the kubelet service on a node and then stopping the container running nvidia-device-plugin will resolve this issue.

2. Steps to reproduce the issue

Install all prerequisites, follow: https://docs.nvidia.com/datacenter/kubernetes/kubernetes-install-guide/index.html

After setup of master and worker nodes. Run nvidia-smi on any node and see that it's power state is P0.

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Timestamp : Mon Apr 29 11:35:02 2019 Driver Version : 418.56 CUDA Version : 10.1

Attached GPUs : 2 GPU 00000000:1A:00.0 Product Name : TITAN RTX Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0325018152742 GPU UUID : GPU-27e5d59b-0bef-70b5-2768-a1aee84e2485 Minor Number : 0 VBIOS Version : 90.02.23.00.01 MultiGPU Board : No Board ID : 0x1a00 GPU Part Number : 900-1G150-2500-000 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x1A Device : 0x00 Domain : 0x0000 Device Id : 0x1E0210DE Bus Id : 00000000:1A:00.0 Sub System Id : 0x12A310DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 41 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24190 MiB Used : 10 MiB Free : 24180 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 40 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 73.93 W Power Limit : 280.00 W Default Power Limit : 280.00 W Enforced Power Limit : 280.00 W Min Power Limit : 100.00 W Max Power Limit : 320.00 W Clocks Graphics : 1350 MHz SM : 1350 MHz Memory : 7000 MHz Video : 1245 MHz Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Default Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:68:00.0 Product Name : TITAN RTX Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0325018152708 GPU UUID : GPU-49e7d8ba-1f24-c3d0-fd77-da03c38d18e2 Minor Number : 1 VBIOS Version : 90.02.23.00.01 MultiGPU Board : No Board ID : 0x6800 GPU Part Number : 900-1G150-2500-000 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x68 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0210DE Bus Id : 00000000:68:00.0 Sub System Id : 0x12A310DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 40 % Performance State : P8 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24187 MiB Used : 10 MiB Free : 24177 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 31 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 26.02 W Power Limit : 280.00 W Default Power Limit : 280.00 W Enforced Power Limit : 280.00 W Min Power Limit : 100.00 W Max Power Limit : 320.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Default Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)

{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }

Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`

Client: Version: 18.09.4 API version: 1.39 Go version: go1.10.8 Git commit: d14af54 Built: Wed Mar 27 18:34:51 2019 OS/Arch: linux/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 18.09.4 API version: 1.39 (minimum version 1.12) Go version: go1.10.8 Git commit: d14af54 Built: Wed Mar 27 18:01:48 2019 OS/Arch: linux/amd64 Experimental: false

 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-============================================-===========================-===========================-============================================================================================= ii libnvidia-container-tools 1.0.2-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.0.2-1 amd64 NVIDIA container runtime library ii nvidia-418 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA binary driver - version 418.56 ii nvidia-418-dev 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA binary Xorg driver development files un nvidia-common (no description available) ii nvidia-container-runtime 2.0.0+docker18.09.4-1 amd64 NVIDIA container runtime ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook un nvidia-docker (no description available) ii nvidia-docker2 2.0.3+docker18.09.4-1 all nvidia-docker CLI wrapper un nvidia-driver-binary (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) un nvidia-libopencl1-418 (no description available) un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 418.39-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files un nvidia-opencl-icd (no description available) ii nvidia-opencl-icd-418 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA OpenCL ICD un nvidia-persistenced (no description available) ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime ii nvidia-settings 418.56-0ubuntu0~gpu16.04.1 amd64 Tool for configuring the NVIDIA graphics driver un nvidia-settings-binary (no description available) un nvidia-smi (no description available) un nvidia-vdpau-driver (no description available)


 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.0.2 build date: 2019-03-26T03:56+00:00 build revision: ff40da533db929bf515aca59ba4c701a65a35e6b build compiler: gcc-5 5.4.0 20160609 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

imadcat commented 4 years ago

I have the same lock at P0 problem, but I didn't have docker running

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 7 months ago

This issue was automatically closed due to inactivity.