ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
179 stars 55 forks source link

Cannot modify the GPU operating voltage using rocm-smi #52

Closed sabbaghm closed 5 years ago

sabbaghm commented 5 years ago

I successfully installed the latest version of RoCm through AMD repos on an Ubuntu 16.04 (kernel version -- 4.15), and wanted to use RoCm-smi to change my RX580 (polaris) GPU clock-voltage but it cannot change or modify the voltage. The error says:

Unable to write to sysfs file /sys/class/drm/card0/device/pp_od_clk_voltage GPU[0] : unable to set gpu clock to Level s 5 1400 980

I am using the following command even with root privileges: rocm-smi --setslevel 5 1400 980

I also tried using the amdgpu.featuremask=0xffffffff in the grub config, but the same error pops-up.

I could successfully change the clock (and observe the change) by overclocking through "rocm-smi -setoverdrive 20" but not for the voltage (or clock-voltage combination).

My target hardware: Motherboard: Dell precision CPU: Intel Xeon E5-1603 v4 GPU: AMD RX 580 (polaris) OC enabled

Any ideas on what could be the issue and how to resolve it?

Thank you.

jlgreathouse commented 5 years ago

Could you please show the OD table output of your card? e.g. by running rocm-smi --showclkvolt

I suspect that you are trying to set SCLK level 5 to an invalid value. For instance, please see this post for some of the rules that you need to follow when setting new DPM configurations. I suspect that you are trying to set the DPM SCLK level 5 to a higher frequency or voltage than the existing SCLK level 6, or perhaps setting the voltage to somewhere below what you have in SCLK level 4.

To verify, could you also run dmestg | grep ppfeaturemask ? I'll note that you must have the OD bit enabled in your ppfeaturemask, and this must be set at boot time (e.g. you must do update-grub if you are setting this in /etc/default/grub.)

sabbaghm commented 5 years ago

Thank you very much @jlgreathouse

I only missed update-grub after adding amdgpu.featuremask=0xffffffff to /etc/default/grub. That was the issue. Now I can change the clock as well as the voltage using rocm-smi --setslevel.

dmesg | grep ppfeaturemask shows the mask is activated.

Note, I could change the frequency and clock even by not following the DPM setting rules. I did verified it using rocm-smi --showclkvolt, just I am not sure if it is actually reflected on the hardware or not. Or rather if it is stable.

jlgreathouse commented 5 years ago

Based on your description, it looks like we found the initial problem. I'll go ahead and close the issue. Please reopen if you believe the problem is not solved. :)

I'll note that if you try to set an invalid DPM table entry, the driver will accept the request, but the underlying hardware will not. I've verified this with benchmarking while writing the post I linked to. Please make sure you follow those rules. :)

sabbaghm commented 4 years ago

@jlgreathouse Hi, I am back, reporting a similar issue but on a much newer machine with a Navi-10 5500-XT card.

System spec:

OS: Ubuntu 20.04.1 LTS with 5.4.0-52-generic kernel (Also tried, 18.04 with the generic kernel) ROCM 3.8 (installed via official apt repo)

I cannot use --setsclk or --setsvc to change the clock/voltage of the GPU, it outputs:

Unable to write to sysfs file /sys/class/drm/card0/device/pp_od_clk_voltage
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set voltage point 0 to 500(MHz) 700(mV)
WARNING: One or more commands failed
==============================End of ROCm SMI Log ==============================

I also tried writing directly to /sys/class/drm/card0/device/pp_od_clk_voltage (with root) its says: bash: echo: write error: Invalid argument

Note, I see amdgpu.ppfeaturemask=0xffffffff in dmesg.

It is probably not needed for Navi-10 (gfx1012), but my system does support PCIe atomics too.

Here is my caton pp_od_clk_voltage:

cat  /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 1850Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 500MHz 705mV
1: 1175MHz 789mV
2: 1850MHz 1105mV
OD_RANGE:
SCLK:     800Mhz       2200Mhz
MCLK:     625Mhz        930Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       2200Mhz
VDDC_CURVE_VOLT[0]:     700mV        1150mV
VDDC_CURVE_SCLK[1]:     800Mhz       2200Mhz
VDDC_CURVE_VOLT[1]:     700mV        1150mV
VDDC_CURVE_SCLK[2]:     800Mhz       2200Mhz
VDDC_CURVE_VOLT[2]:     700mV        1150mV

Could you help me identifying the problem and resolving this issue with the new model? I need to have (ROCM) scripts on ubuntu to overdrive the GPU as before. Thanks