ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
179 stars 55 forks source link

Unable to get various stats in rocm 2.9 #73

Closed misos1 closed 5 years ago

misos1 commented 5 years ago
================================================================================
ERROR: GPU[0]       : Unable to get maximum Graphics Package Power
ERROR: GPU[1]       : Unable to get maximum Graphics Package Power
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get Power Profile
ERROR: GPU[1]       : Unable to get Power Profile
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get Average Graphics Package Power Consumption
ERROR: GPU[1]       : Unable to get Average Graphics Package Power Consumption
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU use.
ERROR: GPU[1]       : Unable to get GPU use.
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
ERROR: GPU[1]       : Unable to get GPU memory use.
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get PCIe replay count
ERROR: GPU[1]       : Unable to get PCIe replay count
================================================================================
================================================================================
GPU[0]      : Unique ID: N/A
GPU[1]      : Unique ID: N/A
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
GPU[1]      : Serial Number: N/A
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to display PowerPlay table
ERROR: GPU[1]       : Unable to display PowerPlay table
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to display voltage
ERROR: GPU[1]       : Unable to display voltage
================================================================================
================================================================================
================================================================================
GPU[0]      : Unable to get voltage curve
GPU[1]      : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================

Also now there is again only one temperature instead of 3 as before (junction, ...):

GPU[0]      : Temperature (Sensor #1) (C): 34.0
GPU[1]      : Temperature (Sensor #1) (C): 27.0
kentrussell commented 5 years ago

Looks like your GPU doesn't support that functionality. What GPU do you have?

misos1 commented 5 years ago

With previous versions of rocm like 2.8 were almost all of these entries available:

================================================================================
GPU[0]      : Max Graphics Package Power (W): 264.0
GPU[1]      : Max Graphics Package Power (W): 220.0
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
GPU[1]      : 
GPU[1]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[1]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[1]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[1]      :   2   POWER_SAVING :             90  60          0              0
GPU[1]      :   3          VIDEO :             70  60          0              0
GPU[1]      :   4             VR :             70  90          0              0
GPU[1]      :   5        COMPUTE :             30  60          0              6
GPU[1]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power (W): 3.0
GPU[1]      : Average Graphics Package Power (W): 3.0
================================================================================
================================================================================
GPU[0]      : GPU use (%): 0
GPU[1]      : GPU use (%): 0
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
ERROR: GPU[1]       : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0]      : PCIe Replay Count: 0
GPU[1]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]      : Unique ID: 0215054ab5c808c4
GPU[1]      : Unique ID: 0213fbda0ae038a4
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
GPU[1]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:

================================================================================
ERROR: GPU[0]       : Unable to display PowerPlay table
ERROR: GPU[1]       : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0]      : Voltage (mV): 750
GPU[1]      : Voltage (mV): 750
================================================================================
================================================================================
================================================================================
==============================End of ROCm SMI Log ==============================
================================================================================
================================================================================
GPU[0]      : Temperature (Sensor edge) (C): 28.0
GPU[0]      : Temperature (Sensor junction) (C): 28.0
GPU[0]      : Temperature (Sensor mem) (C): 27.0
GPU[1]      : Temperature (Sensor edge) (C): 26.0
GPU[1]      : Temperature (Sensor junction) (C): 27.0
GPU[1]      : Temperature (Sensor mem) (C): 25.0
================================================================================
================================================================================
GPU[0]      : dcefclk clock level: 0 (600Mhz)
GPU[0]      : mclk clock level: 0 (167Mhz)
GPU[0]      : pcie clock level: 0 (8.0GT/s, x16)
GPU[0]      : sclk clock level: 0 (852Mhz)
GPU[0]      : socclk clock level: 0 (600Mhz)
================================================================================
GPU[1]      : dcefclk clock level: 0 (600Mhz)
GPU[1]      : mclk clock level: 0 (167Mhz)
GPU[1]      : pcie clock level: 0 (8.0GT/s, x16)
GPU[1]      : sclk clock level: 0 (852Mhz)
GPU[1]      : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
kentrussell commented 5 years ago

That's definitely concerning then. What GPU have you got? Maybe we hit a regression with the firmware or in the kernel code, since there won't be anything runtime-related that could've caused this, it's all sysfs and kernel. rocm-smi -i should be enough to get me looking as to what firmware it could be

misos1 commented 5 years ago

Oh sorry I had either somehow corrupted installation or it needed to reboot. I now reinstalled rocm and before rebooting it looked like I posted in beginning. But after reboot this problem disappeared. Probably firmware or kernel module needed to be loaded. GPUs are Vega 10 XT and Vega 10 XTX.

GPU[0]      : GPU ID: 0x687f
GPU[1]      : GPU ID: 0x6863

Only things which rocm-smi does not show are these:

================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
ERROR: GPU[1]       : Unable to get GPU memory use.
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to display PowerPlay table
ERROR: GPU[1]       : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0]      : Unable to get voltage curve
GPU[1]      : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================

But this was like this also before so probably my GPUs do not support them.

kentrussell commented 5 years ago

Glad to see that things are working properly. Voltage Curve/PP Table is Vega20 non-server only. GPU Memory Use I think is only VG20-and-later as well, since it's not in Vega10's SMU firmware. So that seems to be "functioning as expected"

misos1 commented 4 years ago

Is GPU memory use not based on values shown with --showmeminfo? Because this is little strange:

$ rocm-smi --showmeminfo all
GPU[0]      : vram Total Memory (B): 8573157376
GPU[0]      : vram Total Used Memory (B): 140845056
GPU[0]      : vis_vram Total Memory (B): 268435456
GPU[0]      : vis_vram Total Used Memory (B): 15654912
GPU[0]      : gtt Total Memory (B): 67363909632
GPU[0]      : gtt Total Used Memory (B): 147021824
GPU[1]      : vram Total Memory (B): 17163091968
GPU[1]      : vram Total Used Memory (B): 199143424
GPU[1]      : vis_vram Total Memory (B): 268435456
GPU[1]      : vis_vram Total Used Memory (B): 22470656
GPU[1]      : gtt Total Memory (B): 67363909632
GPU[1]      : gtt Total Used Memory (B): 26001408

But

$ rocm-smi --showmemuse
ERROR: GPU[0]       : Unable to get GPU memory use.
ERROR: GPU[1]       : Unable to get GPU memory use.

And concise output somehow knows VRAM%:

$ rocm-smi
GPU  Temp   AvgPwr  SCLK     MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    71.0c  96.0W   1302Mhz  945Mhz  23.92%  auto  264.0W    2%   91%   
1    64.0c  10.0W   1269Mhz  945Mhz  16.86%  auto  220.0W    0%   0%    
kentrussell commented 4 years ago

GPU memory use is more accurately described as "GPU memory busy rate". Basically it polls the GPU X number of times to see if the memory block is in use. If so, it's a 1. If not, it's a 0. Let's say you had 10 polls with 5 1s and 5 0s, that would be a busy rate (memory utilization rate) of 50%;

If you had a single memory allocation of all of VRAM, then your utilization would be 1%, but the VRAM used would be 100% . I know, the wording is confusing, but both metrics have useful applications, it's just not explained too clearly outside of the kernel documentation