Closed another-pjohnson closed 7 months ago
I have the same lock at P0 problem, but I didn't have docker running
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.
1. Issue or feature description
I noticed that nvidia-device-plugin daemon will keep my gpu_0 in a P0 state, which means for the RTX Titan, it will run at 1350 MHz even when the system is idle and nothing is running on the GPU.
Disabling the kubelet service on a node and then stopping the container running nvidia-device-plugin will resolve this issue.
2. Steps to reproduce the issue
Install all prerequisites, follow: https://docs.nvidia.com/datacenter/kubernetes/kubernetes-install-guide/index.html
After setup of master and worker nodes. Run nvidia-smi on any node and see that it's power state is P0.
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your hostTimestamp : Mon Apr 29 11:35:02 2019 Driver Version : 418.56 CUDA Version : 10.1
Attached GPUs : 2 GPU 00000000:1A:00.0 Product Name : TITAN RTX Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0325018152742 GPU UUID : GPU-27e5d59b-0bef-70b5-2768-a1aee84e2485 Minor Number : 0 VBIOS Version : 90.02.23.00.01 MultiGPU Board : No Board ID : 0x1a00 GPU Part Number : 900-1G150-2500-000 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x1A Device : 0x00 Domain : 0x0000 Device Id : 0x1E0210DE Bus Id : 00000000:1A:00.0 Sub System Id : 0x12A310DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 41 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24190 MiB Used : 10 MiB Free : 24180 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 40 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 73.93 W Power Limit : 280.00 W Default Power Limit : 280.00 W Enforced Power Limit : 280.00 W Min Power Limit : 100.00 W Max Power Limit : 320.00 W Clocks Graphics : 1350 MHz SM : 1350 MHz Memory : 7000 MHz Video : 1245 MHz Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Default Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None
GPU 00000000:68:00.0 Product Name : TITAN RTX Product Brand : Titan Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0325018152708 GPU UUID : GPU-49e7d8ba-1f24-c3d0-fd77-da03c38d18e2 Minor Number : 1 VBIOS Version : 90.02.23.00.01 MultiGPU Board : No Board ID : 0x6800 GPU Part Number : 900-1G150-2500-000 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x68 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0210DE Bus Id : 00000000:68:00.0 Sub System Id : 0x12A310DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 40 % Performance State : P8 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24187 MiB Used : 10 MiB Free : 24177 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 31 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 26.02 W Power Limit : 280.00 W Default Power Limit : 280.00 W Enforced Power Limit : 280.00 W Min Power Limit : 100.00 W Max Power Limit : 320.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Default Applications Clocks Graphics : 1350 MHz Memory : 7001 MHz Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
Client: Version: 18.09.4 API version: 1.39 Go version: go1.10.8 Git commit: d14af54 Built: Wed Mar 27 18:34:51 2019 OS/Arch: linux/amd64 Experimental: false
Server: Docker Engine - Community Engine: Version: 18.09.4 API version: 1.39 (minimum version 1.12) Go version: go1.10.8 Git commit: d14af54 Built: Wed Mar 27 18:01:48 2019 OS/Arch: linux/amd64 Experimental: false
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-============================================-===========================-===========================-============================================================================================= ii libnvidia-container-tools 1.0.2-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.0.2-1 amd64 NVIDIA container runtime library ii nvidia-418 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA binary driver - version 418.56 ii nvidia-418-dev 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA binary Xorg driver development files un nvidia-common (no description available)
ii nvidia-container-runtime 2.0.0+docker18.09.4-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
un nvidia-docker (no description available)
ii nvidia-docker2 2.0.3+docker18.09.4-1 all nvidia-docker CLI wrapper
un nvidia-driver-binary (no description available)
un nvidia-legacy-340xx-vdpau-driver (no description available)
un nvidia-libopencl1-418 (no description available)
un nvidia-libopencl1-dev (no description available)
ii nvidia-modprobe 418.39-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
un nvidia-opencl-icd (no description available)
ii nvidia-opencl-icd-418 418.56-0ubuntu0~gpu16.04.1 amd64 NVIDIA OpenCL ICD
un nvidia-persistenced (no description available)
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 418.56-0ubuntu0~gpu16.04.1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary (no description available)
un nvidia-smi (no description available)
un nvidia-vdpau-driver (no description available)
version: 1.0.2 build date: 2019-03-26T03:56+00:00 build revision: ff40da533db929bf515aca59ba4c701a65a35e6b build compiler: gcc-5 5.4.0 20160609 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections