XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] unable to display power limit #86

Closed liblaf closed 1 year ago

liblaf commented 1 year ago

Required prerequisites

What version of nvitop are you using?

1.2.0

Operating system and version

Arch Linux x86_64

NVIDIA driver version

535.86.05

NVIDIA-SMI

Fri Aug  4 16:37:22 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   34C    P8               2W / 115W |     25MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2072      G   /usr/lib/Xorg                                20MiB |
+---------------------------------------------------------------------------------------+

Python environment

Installation
pipx install nvitop
Python version & relevant libraries
3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429] linux
nvidia-ml-py==11.525.131
nvitop==1.2.0

Problem description

nvitop cannot display power limit correctly.

$ nvitop --once
Fri Aug 04 16:42:38 2023
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.2.0       Driver Version: 535.86.05      CUDA Driver Version: 12.2 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╡
│   0  .. 4060 Laptop GPU  Off  │ 00000000:01:00.0 Off │                  N/A │
│ N/A   38C    P8      2W / N/A │      26MiB / 8188MiB │      0%      Default │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╛

Steps to Reproduce

Command lines:

nvitop --once

Traceback

No response

Logs

[DEBUG] 2023-08-04 16:43:45,692 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2023-08-04 16:43:45,694 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is available.

Expected behavior

nvitop should display power limit as nvidia-smi.

Additional context

No response

XuehaiPan commented 1 year ago

Hi @liblaf, thanks for raising this issue.

Could you try the following code in your Python REPL?

$ python3
>>> from nvitop import *
>>> d = Device(0)
>>> libnvml.nvmlDeviceGetPowerUsage(d.handle)  # power usage in milliwatts (mW)
59085
>>> d.power_usage()  # power usage in milliwatts (mW)
59147
>>> libnvml.nvmlDeviceGetPowerManagementLimit(d.handle)  # power limit in milliwatts (mW)
400000
>>> d.power_limit()  # power limit in milliwatts (mW)
400000

Then we can dive deeper into this error.

liblaf commented 1 year ago
$ ${HOME}/.local/pipx/venvs/nvitop/bin/python
Python 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nvitop import *
>>> d = Device(0)
>>> libnvml.nvmlDeviceGetPowerUsage(d.handle)  # power usage in milliwatts (mW)
2348
>>> d.power_usage()  # power usage in milliwatts (mW)
2347
>>> libnvml.nvmlDeviceGetPowerManagementLimit(d.handle)  # power limit in milliwatts (mW)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/liblaf/.local/pipx/venvs/nvitop/lib/python3.11/site-packages/pynvml.py", line 2394, in nvmlDeviceGetPowerManagementLimit
    _nvmlCheckReturn(ret)
  File "/home/liblaf/.local/pipx/venvs/nvitop/lib/python3.11/site-packages/pynvml.py", line 855, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
>>> d.power_limit()  # power limit in milliwatts (mW)
'N/A'

is this issue caused by a mismatch between the version of nvidia-ml-py==11.525.131 and the nvidia==535.86.05 driver?

liblaf commented 1 year ago

i tried to install the latest nvidia-ml-py==12.535.77 but it produces the same error.

XuehaiPan commented 1 year ago

@liblaf Thanks for the feedback.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/liblaf/.local/pipx/venvs/nvitop/lib/python3.11/site-packages/pynvml.py", line 2394, in nvmlDeviceGetPowerManagementLimit
    _nvmlCheckReturn(ret)
  File "/home/liblaf/.local/pipx/venvs/nvitop/lib/python3.11/site-packages/pynvml.py", line 855, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported

Based on your traceback, your NVIDIA product may not support the power management feature.

According to the NVML documentation nvmlDeviceGetPowerManagementLimit:

Returns NVML_SUCCESS if limit has been set NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized NVML_ERROR_INVALID_ARGUMENT if device is invalid or limit is NULL NVML_ERROR_NOT_SUPPORTED if the device does not support this feature NVML_ERROR_GPU_IS_LOST if the target GPU has fallen off the bus or is otherwise inaccessible NVML_ERROR_UNKNOWN on any unexpected error

Description Retrieves the power management limit associated with this device.

For Fermi or newer fully supported devices.

The power limit defines the upper boundary for the card's power draw. If the card's total power draw reaches this limit the power management algorithm kicks in.

This reading is only available if power management mode is supported. See nvmlDeviceGetPowerManagementMode.

The power limit API may return NVML_ERROR_NOT_SUPPORTED if the power management feature is not available.

You can try the following command in your console:

$ nvidia-smi --query-gpu=index,power.management,power.draw,power.limit --format=csv
index, power.management, power.draw [W], power.limit [W]
0, Enabled, 59.40 W, 400.00 W
1, Enabled, 61.41 W, 400.00 W
2, Enabled, 63.99 W, 400.00 W
3, Enabled, 60.90 W, 400.00 W
4, Enabled, 63.85 W, 400.00 W
5, Enabled, 61.56 W, 400.00 W
6, Enabled, 62.22 W, 400.00 W
7, Enabled, 59.93 W, 400.00 W
liblaf commented 1 year ago

indeed

$ nvidia-smi --query-gpu=index,power.management,power.draw,power.limit --format=csv
index, power.management, power.draw [W], power.limit [W]
0, [N/A], 2.55 W, [N/A]