rocm-smi --showpids reports the number of GPUs used by the process.
However, the presentation makes it easy to assume that it shows which GPUs are used.
We are having the users of our application confused, thinking that all the processes run on the same GPU:
$ rocm-smi --showpids
========================= ROCm System Management Interface =========================
================================== KFD Processes ===================================
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
KFD process information
PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY
55573 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN
55571 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN
55574 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN
55572 gmx_mpi 1 UNKNOWN UNKNOWN UNKNOWN
====================================================================================
=============================== End of ROCm SMI Log ================================
Compare this with how nvidia-smi reports the similar thing:
$ nvidia-smi
[......]
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 211667 C gmx 320MiB |
| 1 N/A N/A 211667 C gmx 148MiB |
+-----------------------------------------------------------------------------+
It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.
The help output is also unclear about the differences between the two options:
--showpids Show current running KFD PIDs
--showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]] Show GPUs used by specified KFD PIDs (all if no arg
given)
With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!
$ rocm-smi --showpids
======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
Not supported on the given system
Not supported on the given system
Not supported on the given system
Not supported on the given system
KFD process information:
PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY
129835 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN
129836 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN
129834 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN
129837 gmx_mpi 0 UNKNOWN UNKNOWN UNKNOWN
================================================================================
============================= End of ROCm SMI Log ==============================
Suggestion Description
rocm-smi --showpids
reports the number of GPUs used by the process. However, the presentation makes it easy to assume that it shows which GPUs are used.We are having the users of our application confused, thinking that all the processes run on the same GPU:
Compare this with how nvidia-smi reports the similar thing:
It would be better if
rocm-smi --showpids
output was more clear that it reported the number of GPUs used, not their indices.The help output is also unclear about the differences between the two options:
With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!
Operating System
SLES 15
GPU
MI250X
ROCm Component
rocm_smi_lib