ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
178 stars 55 forks source link

Error to read memory #98

Closed mauvray closed 3 years ago

mauvray commented 3 years ago

When I use rocm-smi without argument, I get a value for memory usage, but if I try to use it with --showmemuse I get an error. I am using rocm 4.0.0 and linux with the last kernel with rock module loaded. EDIT: This happen on a ryzen 5 3500U and a vega 56.

kentrussell commented 3 years ago

What error are you seeing when you use that flag?

mauvray commented 3 years ago

Both systems with an older version:

$ rocm-smi --showmemuse
>ERROR: GPU[0]          : Unable to get GPU memory use.

Both systems with the newer version gives me this;

$ rocm_smi.py --showmemuse

>ERROR: 2 GPU[0]: % memory use: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.

RX vega with the new one :

$ rocm_smi.py
>GPU  Temp   AvgPwr  SCLK    MCLK    Fan   Perf  PwrCap  VRAM%  GPU%
>0    17.0c  3.0W    852Mhz  167Mhz  0.0%  auto  165.0W    0%   0%

Ryzen 5 3500U with the new one:

$ rocm_smi.py
>ERROR: 2 GPU[0]: power: RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
>ERROR: 2 GPU[0]:RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
>ERROR: 2 GPU[0]:RSMI_STATUS_NOT_SUPPORTED: This function is not supported in the current environment.
>Traceback (most recent call last):
>  File "/opt/rocm/bin/rocm_smi.py", line 2545, in <module>
>    showAllConcise(deviceList)
>  File "/opt/rocm/bin/rocm_smi.py", line 1064, in showAllConcise
>    (fanLevel, fanSpeed) = getFanSpeed(device)
>  File "/opt/rocm/bin/rocm_smi.py", line 182, in getFanSpeed
 >   return (fl, round((float(fl) / float(fm)) * 100, 2))
>ZeroDivisionError: float division by zero
kentrussell commented 3 years ago

Memory utilization isn't available for APUs, since system memory and video memory are the same pool. Memory utilization is only available for systems with dedicated VRAM. It is also not supported on Vega10 due to limitations of the hardware. See: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L2221

As for the division-by-zero error, that will be fixed in 4.1, which you can see at https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/release/rocm-rel-4.1 . Note that this repository has been deprecated as of 4.1 since the SMI is now using the rocm_smi_lib backend.