ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

[Feature]: Better headers in --showpids #166

Open al42and opened 1 month ago

al42and commented 1 month ago

Suggestion Description

rocm-smi --showpids reports the number of GPUs used by the process. However, the presentation makes it easy to assume that it shows which GPUs are used.

We are having the users of our application confused, thinking that all the processes run on the same GPU:

$ rocm-smi --showpids

========================= ROCm System Management Interface =========================
================================== KFD Processes ===================================
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
KFD process information
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
55573   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55571   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55574   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55572   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
====================================================================================
=============================== End of ROCm SMI Log ================================

Compare this with how nvidia-smi reports the similar thing:

$ nvidia-smi 
[......]
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211667      C   gmx                               320MiB |
|    1   N/A  N/A    211667      C   gmx                               148MiB |
+-----------------------------------------------------------------------------+

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The help output is also unclear about the differences between the two options:

  --showpids                                                       Show current running KFD PIDs
  --showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]]                    Show GPUs used by specified KFD PIDs (all if no arg
                                                                   given)

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

$ rocm-smi  --showpids

======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
Not supported on the given system
Not supported on the given system
Not supported on the given system
Not supported on the given system
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
129835  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129836  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129834  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129837  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
================================================================================
============================= End of ROCm SMI Log ==============================

Operating System

SLES 15

GPU

MI250X

ROCm Component

rocm_smi_lib