Closed misos1 closed 5 years ago
Looks like your GPU doesn't support that functionality. What GPU do you have?
With previous versions of rocm like 2.8 were almost all of these entries available:
================================================================================
GPU[0] : Max Graphics Package Power (W): 264.0
GPU[1] : Max Graphics Package Power (W): 220.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
GPU[1] :
GPU[1] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[1] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[1] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[1] : 2 POWER_SAVING : 90 60 0 0
GPU[1] : 3 VIDEO : 70 60 0 0
GPU[1] : 4 VR : 70 90 0 0
GPU[1] : 5 COMPUTE : 30 60 0 6
GPU[1] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 3.0
GPU[1] : Average Graphics Package Power (W): 3.0
================================================================================
================================================================================
GPU[0] : GPU use (%): 0
GPU[1] : GPU use (%): 0
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
GPU[1] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0215054ab5c808c4
GPU[1] : Unique ID: 0213fbda0ae038a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
GPU[1] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
ERROR: GPU[0] : Unable to display PowerPlay table
ERROR: GPU[1] : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0] : Voltage (mV): 750
GPU[1] : Voltage (mV): 750
================================================================================
================================================================================
================================================================================
==============================End of ROCm SMI Log ==============================
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (C): 28.0
GPU[0] : Temperature (Sensor junction) (C): 28.0
GPU[0] : Temperature (Sensor mem) (C): 27.0
GPU[1] : Temperature (Sensor edge) (C): 26.0
GPU[1] : Temperature (Sensor junction) (C): 27.0
GPU[1] : Temperature (Sensor mem) (C): 25.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 0 (167Mhz)
GPU[0] : pcie clock level: 0 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 0 (600Mhz)
================================================================================
GPU[1] : dcefclk clock level: 0 (600Mhz)
GPU[1] : mclk clock level: 0 (167Mhz)
GPU[1] : pcie clock level: 0 (8.0GT/s, x16)
GPU[1] : sclk clock level: 0 (852Mhz)
GPU[1] : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
That's definitely concerning then. What GPU have you got? Maybe we hit a regression with the firmware or in the kernel code, since there won't be anything runtime-related that could've caused this, it's all sysfs and kernel. rocm-smi -i should be enough to get me looking as to what firmware it could be
Oh sorry I had either somehow corrupted installation or it needed to reboot. I now reinstalled rocm and before rebooting it looked like I posted in beginning. But after reboot this problem disappeared. Probably firmware or kernel module needed to be loaded. GPUs are Vega 10 XT and Vega 10 XTX.
GPU[0] : GPU ID: 0x687f
GPU[1] : GPU ID: 0x6863
Only things which rocm-smi does not show are these:
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
================================================================================
================================================================================
ERROR: GPU[0] : Unable to display PowerPlay table
ERROR: GPU[1] : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[0] : Unable to get voltage curve
GPU[1] : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================
But this was like this also before so probably my GPUs do not support them.
Glad to see that things are working properly. Voltage Curve/PP Table is Vega20 non-server only. GPU Memory Use I think is only VG20-and-later as well, since it's not in Vega10's SMU firmware. So that seems to be "functioning as expected"
Is GPU memory use
not based on values shown with --showmeminfo
? Because this is little strange:
$ rocm-smi --showmeminfo all
GPU[0] : vram Total Memory (B): 8573157376
GPU[0] : vram Total Used Memory (B): 140845056
GPU[0] : vis_vram Total Memory (B): 268435456
GPU[0] : vis_vram Total Used Memory (B): 15654912
GPU[0] : gtt Total Memory (B): 67363909632
GPU[0] : gtt Total Used Memory (B): 147021824
GPU[1] : vram Total Memory (B): 17163091968
GPU[1] : vram Total Used Memory (B): 199143424
GPU[1] : vis_vram Total Memory (B): 268435456
GPU[1] : vis_vram Total Used Memory (B): 22470656
GPU[1] : gtt Total Memory (B): 67363909632
GPU[1] : gtt Total Used Memory (B): 26001408
But
$ rocm-smi --showmemuse
ERROR: GPU[0] : Unable to get GPU memory use.
ERROR: GPU[1] : Unable to get GPU memory use.
And concise output somehow knows VRAM%
:
$ rocm-smi
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 71.0c 96.0W 1302Mhz 945Mhz 23.92% auto 264.0W 2% 91%
1 64.0c 10.0W 1269Mhz 945Mhz 16.86% auto 220.0W 0% 0%
GPU memory use is more accurately described as "GPU memory busy rate". Basically it polls the GPU X number of times to see if the memory block is in use. If so, it's a 1. If not, it's a 0. Let's say you had 10 polls with 5 1s and 5 0s, that would be a busy rate (memory utilization rate) of 50%;
If you had a single memory allocation of all of VRAM, then your utilization would be 1%, but the VRAM used would be 100% . I know, the wording is confusing, but both metrics have useful applications, it's just not explained too clearly outside of the kernel documentation
Also now there is again only one temperature instead of 3 as before (junction, ...):