aristocratos / btop

A monitor of resources
Apache License 2.0
21.37k stars 655 forks source link

[BUG] GPU usage value is always 100% #794

Closed davc0n closed 7 months ago

davc0n commented 8 months ago

Hello,

I'm using Arch Linux and I did install btop and rocm-smi-lib (6.0.0) from official repositories. Unfortunately GPU monitoring does not work correctly, reported usage value is always 100% which I believe is wrong.

2024/03/11 (08:14:08) | ===> btop++ v.1.3.2
2024/03/11 (08:14:08) | DEBUG: Running in DEBUG mode!
2024/03/11 (08:14:08) | INFO: Logger set to DEBUG
2024/03/11 (08:14:08) | DEBUG: Using locale en_US.UTF-8
2024/03/11 (08:14:08) | INFO: Running on /dev/pts/0
2024/03/11 (08:14:08) | INFO: Failed to load libnvidia-ml.so, NVIDIA GPUs will not be detected: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get device name
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get maximum GPU power draw, defaulting to 225W
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get maximum GPU temperature, defaulting to 110°C
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get VRAM utilization
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get GPU power usage
2024/03/11 (08:14:08) | WARNING: ROCm SMI: Failed to get PCIe throughput
2024/03/11 (08:14:08) | DEBUG: Shared::init() : Initialized.
2024/03/11 (08:14:14) | INFO: Quitting! Runtime: 00:00:06
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] (rev d2) (prog-if 00 [VGA controller])
    Subsystem: Lenovo Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series]
    Flags: bus master, fast devsel, latency 0, IRQ 38, IOMMU group 9
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Memory at d0000000 (64-bit, prefetchable) [size=2M]
    I/O ports at 1000 [size=256]
    Memory at d0500000 (32-bit, non-prefetchable) [size=512K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [64] Express Legacy Endpoint, IntMsgNum 0
    Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
    Capabilities: [c0] MSI-X: Enable+ Count=3 Masked-
    Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [200] Physical Resizable BAR
    Capabilities: [270] Secondary PCI Express
    Capabilities: [2a0] Access Control Services
    Capabilities: [2b0] Address Translation Service (ATS)
    Capabilities: [2c0] Page Request Interface (PRI)
    Capabilities: [2d0] Process Address Space ID (PASID)
    Capabilities: [320] Latency Tolerance Reporting
    Kernel driver in use: amdgpu
    Kernel modules: amdgpu

Any help?

Please ask if you need more information.

davc0n commented 8 months ago

After a quick look at the code I found out the reason of the first warning (ROCm SMI: Failed to get device name):

char name[NVML_DEVICE_NAME_BUFFER_SIZE]; // ROCm SMI does not provide a constant for this as far as I can tell, this should be good enough
result = rsmi_dev_name_get(i, name, NVML_DEVICE_NAME_BUFFER_SIZE);
if (result != RSMI_STATUS_SUCCESS)
        Logger::warning("ROCm SMI: Failed to get device name");

The size currently used is 64, and apparently is not enough, result is "RSMI_STATUS_INSUFFICIENT_SIZE". Device name seems properly found using 128.

For the other warnings result is RSMI_STATUS_NOT_SUPPORTED instead (I guess there's nothing we can do here).

EDIT: Issue title has been changed, I was convinced that the cause of the problem was the device not recognized (due to the first warning) but I was wrong.

EDIT#2:

if (gpus_slice[i].supported_functions.gpu_utilization) {
    uint32_t utilization;
    result = rsmi_dev_busy_percent_get(i, &utilization);

Value of result is always 100, so I guess the issue is related to rsmi. Is there anything we can do?

Umio-Yasuno commented 8 months ago

Some Raven/Picasso/Raven2 APU always report gpu_busy_percent as 100.

https://gitlab.freedesktop.org/drm/amd/-/issues/1932