ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
178 stars 55 forks source link

vega 64 voltage values not respected #62

Closed Bednar87 closed 5 years ago

Bednar87 commented 5 years ago

I have a Gigabyte Radeon RX Vega 64 GAMING OC 8GB and I have been playing with undervolting/overclocking the card for a better performance/power draw ratio. I have set the kernel boot parameter correctly and modified the voltages/frequencies as follows:

bednar@bednar-Ubuntu:~$ /opt/rocm/bin/rocm-smi --showclkvolt

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        979Mhz        825mV
GPU[0]      : 2:       1106Mhz        850mV
GPU[0]      : 3:       1233Mhz        875mV
GPU[0]      : 4:       1360Mhz        900mV
GPU[0]      : 5:       1485Mhz        925mV
GPU[0]      : 6:       1500Mhz        950mV
GPU[0]      : 7:       1536Mhz       1000mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        825mV
GPU[0]      : 2:        800Mhz        865mV
GPU[0]      : 3:       1025Mhz       1000mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
==============================End of ROCm SMI Log ==============================

I have also set the power limit to +50%.

However, the card completely ignores those values and the voltage stays at 1200mV with power draw of 330 watts.

Any idea what could be the problem?

Bednar87 commented 5 years ago

I have done some more tests and the results are very buggy. I am not sure where the problem lies and would appreciate any pointers.

I am running Ubuntu 19.04 with the 5.0.0-13-generic kernel. I have specified amdgpu.ppfeaturemask=0xfffd7fff as a boot parameter and installed ROCm from the official debian repository as instructed here on github. The below two commands execute correctly and list the GPU:

/opt/rocm/bin/rocminfo 
/opt/rocm/opencl/bin/x86_64/clinfo 

Here is the output: https://gist.github.com/Bednar87/c7d6827a18895a78734309afdf3a4758

lspci:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4)
00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4)
00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 8 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04)
00:1f.2 IDE interface: Intel Corporation 7 Series/C210 Series Chipset Family 4-port SATA Controller [IDE mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
00:1f.5 IDE interface: Intel Corporation 7 Series/C210 Series Chipset Family 2-port SATA Controller [IDE mode] (rev 04)
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c1)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf8
05:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 41)
07:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 10)
08:00.0 USB controller: Etron Technology, Inc. EJ168 USB 3.0 Host Controller (rev 01)

Output of /opt/rocm/bin/rocm-smi -a before I make any changes whatsoever:

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature: 53.0c
================================================================================
================================================================================
GPU[0]      : dcefclk Clock Level: 0 (600Mhz)
GPU[0]      : mclk Clock Level: 2 (800Mhz)
GPU[0]      : pcie Clock Level: 1 (8.0GT/s, x16)
GPU[0]      : sclk Clock Level: 3 (1138Mhz)
GPU[0]      : socclk Clock Level: 3 (847Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0]      : Current Performance Level: auto
================================================================================
================================================================================
GPU[0]      : Current GPU OverDrive value: 0%
================================================================================
================================================================================
GPU[0]      : Current GPU Memory OverDrive value: 0%
================================================================================
================================================================================
GPU[0]      : Max Graphics Package Power: 247.0W
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power: 12.0W
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz 
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz *
GPU[0]      : 3: 945Mhz 
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 
GPU[0]      : 1: 8.0GT/s, x16 *
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz 
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1084Mhz 
GPU[0]      : 3: 1138Mhz *
GPU[0]      : 4: 1200Mhz 
GPU[0]      : 5: 1401Mhz 
GPU[0]      : 6: 1536Mhz 
GPU[0]      : 7: 1630Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz 
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz *
GPU[0]      : 4: 900Mhz 
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz 
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : Current GPU use: 0%
================================================================================
================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        900mV
GPU[0]      : 2:       1084Mhz        950mV
GPU[0]      : 3:       1138Mhz       1000mV
GPU[0]      : 4:       1200Mhz       1050mV
GPU[0]      : 5:       1401Mhz       1100mV
GPU[0]      : 6:       1536Mhz       1150mV
GPU[0]      : 7:       1630Mhz       1200mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        800mV
GPU[0]      : 2:        800Mhz        950mV
GPU[0]      : 3:        945Mhz       1100mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
==============================End of ROCm SMI Log ==============================

I then set the performance level flag to manual as follows:

/opt/rocm/bin/rocm-smi --setperflevel auto

The command is successful:

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : Successfully set current Performance Level to auto
================================================================================
==============================End of ROCm SMI Log ==============================

/opt/rocm/bin/rocm-smi --showperflevel

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : Current Performance Level: auto
================================================================================
==============================End of ROCm SMI Log ==============================

Setting the power overdrive value through rocm-smi fails:

/opt/rocm/bin/rocm-smi --setpoweroverdrive 330

Traceback (most recent call last):
  File "/opt/rocm/bin/rocm-smi", line 1910, in <module>
    setPowerOverDrive(deviceList, args.setpoweroverdrive, args.autorespond)
  File "/opt/rocm/bin/rocm-smi", line 1396, in setPowerOverDrive
    power_cap_path = getFilePath(device, 'power1_cap')
  File "/opt/rocm/bin/rocm-smi", line 130, in getFilePath
    pathDict = valuePaths[key]
KeyError: 'power1_cap'

But I later find a way to write directly to the file.

Then I proceed to set the frequencies and voltages:

/opt/rocm/bin/rocm-smi  --setmlevel 3 1025 975
/opt/rocm/bin/rocm-smi  --setslevel 7 1550 975

I set all of them one by one with the final result as follows:

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature: 54.0c
================================================================================
================================================================================
GPU[0]      : dcefclk Clock Level: 0 (600Mhz)
GPU[0]      : mclk Clock Level: 2 (800Mhz)
GPU[0]      : pcie Clock Level: 1 (8.0GT/s, x16)
GPU[0]      : sclk Clock Level: 3 (1138Mhz)
GPU[0]      : socclk Clock Level: 3 (847Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 58 (22%)
================================================================================
================================================================================
GPU[0]      : Current Performance Level: manual
================================================================================
================================================================================
GPU[0]      : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0]      : Current GPU Memory OverDrive value: 9%
================================================================================
================================================================================
GPU[0]      : Max Graphics Package Power: 330.0W
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power: 15.0W
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz 
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz *
GPU[0]      : 3: 1025Mhz 
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 
GPU[0]      : 1: 8.0GT/s, x16 *
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz 
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1084Mhz 
GPU[0]      : 3: 1138Mhz *
GPU[0]      : 4: 1250Mhz 
GPU[0]      : 5: 1370Mhz 
GPU[0]      : 6: 1475Mhz 
GPU[0]      : 7: 1550Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz 
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz *
GPU[0]      : 4: 900Mhz 
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz 
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : Current GPU use: 0%
================================================================================
================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        825mV
GPU[0]      : 2:       1084Mhz        850mV
GPU[0]      : 3:       1138Mhz        875mV
GPU[0]      : 4:       1250Mhz        900mV
GPU[0]      : 5:       1370Mhz        925mV
GPU[0]      : 6:       1475Mhz        950mV
GPU[0]      : 7:       1550Mhz        975mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        825mV
GPU[0]      : 2:        800Mhz        865mV
GPU[0]      : 3:       1025Mhz        975mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
==============================End of ROCm SMI Log ==============================

Then I run the Heaven benchmark while watching sudo watch -n 0.5 cat /sys/kernel/debug/dri/0/amdgpu_pm_info in a terminal window.

GFX Clocks and Power:
        800 MHz (MCLK)
        1447 MHz (SCLK)
        1138 MHz (PSTATE_SCLK)
        800 MHz (PSTATE_MCLK)
        1200 mV (VDDGFX)
        246.0 W (average GPU)

GPU Temperature: 62 C
GPU Load: 99 %

SMC Feature Mask: 0x000000001ba1fb4f
UVD: Disabled

VCE: Disabled

The voltage is stuck at 1200mV and memory frequency at 800 MHz (sometimes the memory frequency will be stuck at 167MHz) and sometimes, depending on the frequency and voltages, the card reaches P7 and M3 but completely ignores the voltage setting.

Example of values where the voltage setting is completely ignored:

OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        900mV
2:       1084Mhz        910mV
3:       1138Mhz        930mV
4:       1250Mhz        945mV
5:       1375Mhz        960mV
6:       1475Mhz        995mV
7:       1550Mhz       1050mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        850mV
2:        800Mhz        910mV
3:       1000Mhz       1050mV
OD_RANGE:
SCLK:     852MHz       2400MHz
MCLK:     167MHz       1500MHz
VDDC:     800mV        1200mV

Thanks,

kentrussell commented 5 years ago

Can you try the 2.5 release? The upstream team made a couple fixes for voltage on Vega10 (which is the base for your Vega64). Thanks!

ArSd-g commented 5 years ago

Same here...i'm getting mad...

...causing mining instability and low hashrates but high power draw if " --setpoweroverdrive " is not set... others 'bad ass' suspected ; kernel and/or opencl drivers

kentrussell commented 5 years ago

@arS-en Did you try the 2.5 build? The upstream kernel made some voltage fixes for Vega10 that may help

hbfs commented 5 years ago

Thanks for the hard work ROC team! Compute with Radeon is a huge improvement from a year ago.

Had similar issues on water-cooled Vega 64 reference card on ubuntu 18.04.2-server with ROCM 2.5-27

After trial and error I was able to beat performance and efficiency from win 10 x64 on ubuntu 18.04.2-server

Key takeaway/what worked for me:

Hope this helps.

I'd like to see rocm-smi become more consistent/accurate of modifications and report non-action vs silent. Maybe functionality like this is best done via sysfs directly

final script:

rocm-smi --autorespond y --setpoweroverdrive 130

rocm-smi --autorespond y --setmlevel 2 800 800

rocm-smi --autorespond y --setslevel 0 930 805
rocm-smi --autorespond y --setmlevel 3 1100 805

rocm-smi --autorespond y --setsclk 0 --setmclk 3
kentrussell commented 5 years ago

Couldn't reproduce it on VG10 on 2.7. Closing for now unless the issue persists on your system

Bednar87 commented 5 years ago

Hi @kentrussell

I am still facing this issue on 2.7. I previously described two problems:

The first issue is still there. Whenever the card reaches P7, the voltage cap goes out of the window and shoots up to 1200mV. The memory frequency bug seems to have been fixed in the sense that it no longer gets stuck @ 167 or 800 MHz but at the M3 value, in my case 1050MHz.

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature (Sensor edge) (c): 54.0
GPU[0]      : Temperature (Sensor junction) (c): 56.0
GPU[0]      : Temperature (Sensor mem) (c): 52.0
================================================================================
================================================================================
GPU[0]      : dcefclk clock level: 0 (600Mhz)
GPU[0]      : mclk clock level: 3 (1050Mhz)
GPU[0]      : pcie clock level: 1 (8.0GT/s, x16)
GPU[0]      : sclk clock level: 0 (852Mhz)
GPU[0]      : socclk clock level: 4 (900Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0]      : Performance Level: manual
================================================================================
================================================================================
================================================================================
================================================================================
GPU[0]      : GPU Memory OverDrive value (%): 12
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0]      : Max Graphics Package Power (W): 320.0
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power (W): 27.0
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz 
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 1050Mhz *
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 
GPU[0]      : 1: 8.0GT/s, x16 *
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz *
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1050Mhz 
GPU[0]      : 3: 1125Mhz 
GPU[0]      : 4: 1200Mhz 
GPU[0]      : 5: 1275Mhz 
GPU[0]      : 6: 1375Mhz 
GPU[0]      : 7: 1450Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz 
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz *
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz 
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : GPU use (%): 1
================================================================================
================================================================================
================================================================================
================================================================================
GPU[0]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]      : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        900mV
GPU[0]      : 2:       1050Mhz        910mV
GPU[0]      : 3:       1125Mhz        920mV
GPU[0]      : 4:       1200Mhz        930mV
GPU[0]      : 5:       1275Mhz        940mV
GPU[0]      : 6:       1375Mhz        945mV
GPU[0]      : 7:       1450Mhz        950mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        850mV
GPU[0]      : 2:        800Mhz        910mV
GPU[0]      : 3:       1050Mhz        950mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
================================================================================
GPU[0]      : Voltage (mV): 1200
================================================================================
==============================End of ROCm SMI Log ==============================
kentrussell commented 5 years ago

For mclk getting stuck at level 3, does it happen if you do "--setmclk 0 1 2 3"? --setmclk 3 just tells it to stick at level 3, while --setmclk 0 1 2 3 will let it alternate between levels based on workload. As for the voltage sticking at 1200mV, that's definitely more concerning and I'll see what I can reproduce here.

Bednar87 commented 5 years ago

Hi Kent,

Thank you for your continued assistance here. To be sure, I rebooted the machine and decided to start step by step from scratch, documenting every move I make (so to speak).

Output of /opt/rocm/bin/rocm-smi -a before I do anything (GPU idle):

 ========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature (Sensor edge) (c): 50.0
GPU[0]      : Temperature (Sensor junction) (c): 52.0
GPU[0]      : Temperature (Sensor mem) (c): 50.0
================================================================================
================================================================================
GPU[0]      : dcefclk clock level: 0 (600Mhz)
GPU[0]      : mclk clock level: 0 (167Mhz)
GPU[0]      : pcie clock level: 0 (8.0GT/s, x16)
GPU[0]      : sclk clock level: 0 (852Mhz)
GPU[0]      : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 79 (30%)
================================================================================
================================================================================
GPU[0]      : Performance Level: auto
================================================================================
================================================================================
GPU[0]      : GPU OverDrive value (%): 0
================================================================================
================================================================================
GPU[0]      : GPU Memory OverDrive value (%): 0
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0]      : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power (W): 6.0
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz *
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 945Mhz 
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 *
GPU[0]      : 1: 8.0GT/s, x16 
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz *
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1084Mhz 
GPU[0]      : 3: 1138Mhz 
GPU[0]      : 4: 1200Mhz 
GPU[0]      : 5: 1401Mhz 
GPU[0]      : 6: 1536Mhz 
GPU[0]      : 7: 1630Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz 
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : GPU use (%): 3
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]      : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:

================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        900mV
GPU[0]      : 2:       1084Mhz        950mV
GPU[0]      : 3:       1138Mhz       1000mV
GPU[0]      : 4:       1200Mhz       1050mV
GPU[0]      : 5:       1401Mhz       1100mV
GPU[0]      : 6:       1536Mhz       1150mV
GPU[0]      : 7:       1630Mhz       1200mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        800mV
GPU[0]      : 2:        800Mhz        950mV
GPU[0]      : 3:        945Mhz       1100mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
================================================================================
GPU[0]      : Voltage (mV): 787
================================================================================
==============================End of ROCm SMI Log ==============================

Then I set performance level to manual as follows:

/opt/rocm/bin/rocm-smi --setperflevel manual

Output:

 ========================ROCm System Management Interface========================
================================================================================
GPU[0]      : Successfully set current Performance Level to manual
================================================================================
==============================End of ROCm SMI Log ==============================

Then I execute the following commands:

/opt/rocm/bin/rocm-smi --setslevel 0 852 800 ; /opt/rocm/bin/rocm-smi --setslevel 1 991 900 ; /opt/rocm/bin/rocm-smi --setslevel 2 1050 910 ; /opt/rocm/bin/rocm-smi --setslevel 3 1125 920 ; /opt/rocm/bin/rocm-smi --setslevel 4 1200 930 ; /opt/rocm/bin/rocm-smi --setslevel 5 1275 940 ; /opt/rocm/bin/rocm-smi --setslevel 6 1375 955; /opt/rocm/bin/rocm-smi --setslevel 7 1450 975

and:

/opt/rocm/bin/rocm-smi --setmlevel 0 167 800 ; /opt/rocm/bin/rocm-smi --setmlevel 1 500 850 ; /opt/rocm/bin/rocm-smi --setmlevel 2 800 910 ; /opt/rocm/bin/rocm-smi --setmlevel 3 1025 975

Then:

opt/rocm/bin/rocm-smi --setmclk 0 1 2 3

They all execute successfully.

Output of /opt/rocm/bin/rocm-smi - a:

 ========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature (Sensor edge) (c): 47.0
GPU[0]      : Temperature (Sensor junction) (c): 48.0
GPU[0]      : Temperature (Sensor mem) (c): 47.0
================================================================================
================================================================================
GPU[0]      : dcefclk clock level: 0 (600Mhz)
GPU[0]      : mclk clock level: 0 (167Mhz)
GPU[0]      : pcie clock level: 1 (8.0GT/s, x16)
GPU[0]      : sclk clock level: 0 (852Mhz)
GPU[0]      : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 79 (30%)
================================================================================
================================================================================
GPU[0]      : Performance Level: manual
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0]      : GPU Memory OverDrive value (%): 9
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0]      : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power (W): 6.0
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz *
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 1025Mhz 
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 *
GPU[0]      : 1: 8.0GT/s, x16 
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz *
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1050Mhz 
GPU[0]      : 3: 1125Mhz 
GPU[0]      : 4: 1200Mhz 
GPU[0]      : 5: 1275Mhz 
GPU[0]      : 6: 1375Mhz 
GPU[0]      : 7: 1450Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz 
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : GPU use (%): 14
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]      : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:

================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        900mV
GPU[0]      : 2:       1050Mhz        910mV
GPU[0]      : 3:       1125Mhz        920mV
GPU[0]      : 4:       1200Mhz        930mV
GPU[0]      : 5:       1275Mhz        940mV
GPU[0]      : 6:       1375Mhz        955mV
GPU[0]      : 7:       1450Mhz        975mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        850mV
GPU[0]      : 2:        800Mhz        910mV
GPU[0]      : 3:       1025Mhz        975mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
================================================================================
GPU[0]      : Voltage (mV): 850
================================================================================
==============================End of ROCm SMI Log ==============================

Then I run sudo watch -n 0.5 cat /sys/kernel/debug/dri/0/amdgpu_pm_info in one terminal and the heaven benchmark in another.

Load:

load

Idle:

Idle

again Output of /opt/rocm/bin/rocm-smi - a:

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0]      : Temperature (Sensor edge) (c): 54.0
GPU[0]      : Temperature (Sensor junction) (c): 56.0
GPU[0]      : Temperature (Sensor mem) (c): 53.0
================================================================================
================================================================================
GPU[0]      : dcefclk clock level: 0 (600Mhz)
GPU[0]      : mclk clock level: 3 (1025Mhz)
GPU[0]      : pcie clock level: 1 (8.0GT/s, x16)
GPU[0]      : sclk clock level: 0 (852Mhz)
GPU[0]      : socclk clock level: 6 (1028Mhz)
================================================================================
================================================================================
GPU[0]      : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0]      : Performance Level: manual
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0]      : GPU Memory OverDrive value (%): 9
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0]      : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0]      : 
GPU[0]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[0]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[0]      :   2   POWER_SAVING :             90  60          0              0
GPU[0]      :   3          VIDEO :             70  60          0              0
GPU[0]      :   4             VR :             70  90          0              0
GPU[0]      :   5        COMPUTE :             30  60          0              6
GPU[0]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[0]      : Average Graphics Package Power (W): 15.0
================================================================================
================================================================================
GPU[0]      : Supported dcefclk frequencies on GPU0
GPU[0]      : 0: 600Mhz *
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 
GPU[0]      : Supported mclk frequencies on GPU0
GPU[0]      : 0: 167Mhz 
GPU[0]      : 1: 500Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 1025Mhz *
GPU[0]      : 
GPU[0]      : Supported pcie frequencies on GPU0
GPU[0]      : 0: 8.0GT/s, x16 
GPU[0]      : 1: 8.0GT/s, x16 *
GPU[0]      : 
GPU[0]      : Supported sclk frequencies on GPU0
GPU[0]      : 0: 852Mhz *
GPU[0]      : 1: 991Mhz 
GPU[0]      : 2: 1050Mhz 
GPU[0]      : 3: 1125Mhz 
GPU[0]      : 4: 1200Mhz 
GPU[0]      : 5: 1275Mhz 
GPU[0]      : 6: 1375Mhz 
GPU[0]      : 7: 1450Mhz 
GPU[0]      : 
GPU[0]      : Supported socclk frequencies on GPU0
GPU[0]      : 0: 600Mhz 
GPU[0]      : 1: 720Mhz 
GPU[0]      : 2: 800Mhz 
GPU[0]      : 3: 847Mhz 
GPU[0]      : 4: 900Mhz 
GPU[0]      : 5: 960Mhz 
GPU[0]      : 6: 1028Mhz *
GPU[0]      : 7: 1107Mhz 
GPU[0]      : 
================================================================================
================================================================================
GPU[0]      : GPU use (%): 7
================================================================================
================================================================================
ERROR: GPU[0]       : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0]      : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:

================================================================================
GPU[0]      : OD_SCLK:
GPU[0]      : 0:        852Mhz        800mV
GPU[0]      : 1:        991Mhz        900mV
GPU[0]      : 2:       1050Mhz        910mV
GPU[0]      : 3:       1125Mhz        920mV
GPU[0]      : 4:       1200Mhz        930mV
GPU[0]      : 5:       1275Mhz        940mV
GPU[0]      : 6:       1375Mhz        955mV
GPU[0]      : 7:       1450Mhz        975mV
GPU[0]      : OD_MCLK:
GPU[0]      : 0:        167Mhz        800mV
GPU[0]      : 1:        500Mhz        850mV
GPU[0]      : 2:        800Mhz        910mV
GPU[0]      : 3:       1025Mhz        975mV
GPU[0]      : OD_RANGE:
GPU[0]      : SCLK:     852MHz       2400MHz
GPU[0]      : MCLK:     167MHz       1500MHz
GPU[0]      : VDDC:     800mV        1200mV
================================================================================
================================================================================
GPU[0]      : Voltage (mV): 1000
================================================================================
==============================End of ROCm SMI Log ==============================

In this case voltage is stuck at 1000mV most likely because I didn't up the power limit to 330 watts.

kentrussell commented 5 years ago

Out of curiousity, does dmesg and up throwing any information at all? I am hesitant to enable dynamic debugging because there will be an absolute boatload of extra information in there, but the plain dmesg might throw something from powerplay that might help a bit too.

Bednar87 commented 5 years ago

The only thing I see related to powerplay is this:

[ 3.019956] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega10_smu

Full dmesg attached.

dmesg.txt

kentrussell commented 5 years ago

Darn it was worth a shot. I'll get on this once I finish my current task, and see what I can find out. Thanks for your continued help and information!

Bednar87 commented 5 years ago

Hi mate,

This patch fixes the issue for me. would be great to see it applied upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=205277

Bednar87 commented 5 years ago

the patch has been applied to staging.

https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=af15efcd9fd3e75fdab3618ace926543a1f9ebea

great stuff guys

I consider this resolved.