Closed Bednar87 closed 5 years ago
I have done some more tests and the results are very buggy. I am not sure where the problem lies and would appreciate any pointers.
I am running Ubuntu 19.04 with the 5.0.0-13-generic kernel. I have specified amdgpu.ppfeaturemask=0xfffd7fff
as a boot parameter and installed ROCm from the official debian repository as instructed here on github. The below two commands execute correctly and list the GPU:
/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/x86_64/clinfo
Here is the output: https://gist.github.com/Bednar87/c7d6827a18895a78734309afdf3a4758
lspci:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4)
00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4)
00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 8 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04)
00:1f.2 IDE interface: Intel Corporation 7 Series/C210 Series Chipset Family 4-port SATA Controller [IDE mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
00:1f.5 IDE interface: Intel Corporation 7 Series/C210 Series Chipset Family 2-port SATA Controller [IDE mode] (rev 04)
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c1)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf8
05:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 41)
07:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 10)
08:00.0 USB controller: Etron Technology, Inc. EJ168 USB 3.0 Host Controller (rev 01)
Output of /opt/rocm/bin/rocm-smi -a
before I make any changes whatsoever:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature: 53.0c
================================================================================
================================================================================
GPU[0] : dcefclk Clock Level: 0 (600Mhz)
GPU[0] : mclk Clock Level: 2 (800Mhz)
GPU[0] : pcie Clock Level: 1 (8.0GT/s, x16)
GPU[0] : sclk Clock Level: 3 (1138Mhz)
GPU[0] : socclk Clock Level: 3 (847Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0] : Current Performance Level: auto
================================================================================
================================================================================
GPU[0] : Current GPU OverDrive value: 0%
================================================================================
================================================================================
GPU[0] : Current GPU Memory OverDrive value: 0%
================================================================================
================================================================================
GPU[0] : Max Graphics Package Power: 247.0W
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power: 12.0W
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz *
GPU[0] : 3: 945Mhz
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16
GPU[0] : 1: 8.0GT/s, x16 *
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1084Mhz
GPU[0] : 3: 1138Mhz *
GPU[0] : 4: 1200Mhz
GPU[0] : 5: 1401Mhz
GPU[0] : 6: 1536Mhz
GPU[0] : 7: 1630Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz *
GPU[0] : 4: 900Mhz
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : Current GPU use: 0%
================================================================================
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 900mV
GPU[0] : 2: 1084Mhz 950mV
GPU[0] : 3: 1138Mhz 1000mV
GPU[0] : 4: 1200Mhz 1050mV
GPU[0] : 5: 1401Mhz 1100mV
GPU[0] : 6: 1536Mhz 1150mV
GPU[0] : 7: 1630Mhz 1200mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 800mV
GPU[0] : 2: 800Mhz 950mV
GPU[0] : 3: 945Mhz 1100mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
==============================End of ROCm SMI Log ==============================
I then set the performance level flag to manual as follows:
/opt/rocm/bin/rocm-smi --setperflevel auto
The command is successful:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : Successfully set current Performance Level to auto
================================================================================
==============================End of ROCm SMI Log ==============================
/opt/rocm/bin/rocm-smi --showperflevel
========================ROCm System Management Interface========================
================================================================================
GPU[0] : Current Performance Level: auto
================================================================================
==============================End of ROCm SMI Log ==============================
Setting the power overdrive value through rocm-smi fails:
/opt/rocm/bin/rocm-smi --setpoweroverdrive 330
Traceback (most recent call last):
File "/opt/rocm/bin/rocm-smi", line 1910, in <module>
setPowerOverDrive(deviceList, args.setpoweroverdrive, args.autorespond)
File "/opt/rocm/bin/rocm-smi", line 1396, in setPowerOverDrive
power_cap_path = getFilePath(device, 'power1_cap')
File "/opt/rocm/bin/rocm-smi", line 130, in getFilePath
pathDict = valuePaths[key]
KeyError: 'power1_cap'
But I later find a way to write directly to the file.
Then I proceed to set the frequencies and voltages:
/opt/rocm/bin/rocm-smi --setmlevel 3 1025 975
/opt/rocm/bin/rocm-smi --setslevel 7 1550 975
I set all of them one by one with the final result as follows:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature: 54.0c
================================================================================
================================================================================
GPU[0] : dcefclk Clock Level: 0 (600Mhz)
GPU[0] : mclk Clock Level: 2 (800Mhz)
GPU[0] : pcie Clock Level: 1 (8.0GT/s, x16)
GPU[0] : sclk Clock Level: 3 (1138Mhz)
GPU[0] : socclk Clock Level: 3 (847Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 58 (22%)
================================================================================
================================================================================
GPU[0] : Current Performance Level: manual
================================================================================
================================================================================
GPU[0] : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0] : Current GPU Memory OverDrive value: 9%
================================================================================
================================================================================
GPU[0] : Max Graphics Package Power: 330.0W
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power: 15.0W
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz *
GPU[0] : 3: 1025Mhz
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16
GPU[0] : 1: 8.0GT/s, x16 *
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1084Mhz
GPU[0] : 3: 1138Mhz *
GPU[0] : 4: 1250Mhz
GPU[0] : 5: 1370Mhz
GPU[0] : 6: 1475Mhz
GPU[0] : 7: 1550Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz *
GPU[0] : 4: 900Mhz
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : Current GPU use: 0%
================================================================================
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 825mV
GPU[0] : 2: 1084Mhz 850mV
GPU[0] : 3: 1138Mhz 875mV
GPU[0] : 4: 1250Mhz 900mV
GPU[0] : 5: 1370Mhz 925mV
GPU[0] : 6: 1475Mhz 950mV
GPU[0] : 7: 1550Mhz 975mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 825mV
GPU[0] : 2: 800Mhz 865mV
GPU[0] : 3: 1025Mhz 975mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
==============================End of ROCm SMI Log ==============================
Then I run the Heaven benchmark while watching sudo watch -n 0.5 cat /sys/kernel/debug/dri/0/amdgpu_pm_info in a terminal window.
GFX Clocks and Power:
800 MHz (MCLK)
1447 MHz (SCLK)
1138 MHz (PSTATE_SCLK)
800 MHz (PSTATE_MCLK)
1200 mV (VDDGFX)
246.0 W (average GPU)
GPU Temperature: 62 C
GPU Load: 99 %
SMC Feature Mask: 0x000000001ba1fb4f
UVD: Disabled
VCE: Disabled
The voltage is stuck at 1200mV and memory frequency at 800 MHz (sometimes the memory frequency will be stuck at 167MHz) and sometimes, depending on the frequency and voltages, the card reaches P7 and M3 but completely ignores the voltage setting.
Example of values where the voltage setting is completely ignored:
OD_SCLK:
0: 852Mhz 800mV
1: 991Mhz 900mV
2: 1084Mhz 910mV
3: 1138Mhz 930mV
4: 1250Mhz 945mV
5: 1375Mhz 960mV
6: 1475Mhz 995mV
7: 1550Mhz 1050mV
OD_MCLK:
0: 167Mhz 800mV
1: 500Mhz 850mV
2: 800Mhz 910mV
3: 1000Mhz 1050mV
OD_RANGE:
SCLK: 852MHz 2400MHz
MCLK: 167MHz 1500MHz
VDDC: 800mV 1200mV
Thanks,
Can you try the 2.5 release? The upstream team made a couple fixes for voltage on Vega10 (which is the base for your Vega64). Thanks!
Same here...i'm getting mad...
...causing mining instability and low hashrates but high power draw if " --setpoweroverdrive " is not set... others 'bad ass' suspected ; kernel and/or opencl drivers
@arS-en Did you try the 2.5 build? The upstream kernel made some voltage fixes for Vega10 that may help
Thanks for the hard work ROC team! Compute with Radeon is a huge improvement from a year ago.
Had similar issues on water-cooled Vega 64 reference card on ubuntu 18.04.2-server with ROCM 2.5-27
After trial and error I was able to beat performance and efficiency from win 10 x64 on ubuntu 18.04.2-server
Key takeaway/what worked for me:
rocm-smi -a
may indicate it hasmclk
perf level can modify the sclk
perf level and vice versa, even on manual perf moderocm-smi --resetclocks
and then apply any modificationsHope this helps.
I'd like to see rocm-smi
become more consistent/accurate of modifications and report non-action vs silent. Maybe functionality like this is best done via sysfs
directly
final script:
rocm-smi --autorespond y --setpoweroverdrive 130
rocm-smi --autorespond y --setmlevel 2 800 800
rocm-smi --autorespond y --setslevel 0 930 805
rocm-smi --autorespond y --setmlevel 3 1100 805
rocm-smi --autorespond y --setsclk 0 --setmclk 3
Couldn't reproduce it on VG10 on 2.7. Closing for now unless the issue persists on your system
Hi @kentrussell
I am still facing this issue on 2.7. I previously described two problems:
voltage getting stuck @ 1200Mv
and memory frequency at 800 MHz (sometimes the memory frequency will be stuck at 167MHz)
The first issue is still there. Whenever the card reaches P7, the voltage cap goes out of the window and shoots up to 1200mV. The memory frequency bug seems to have been fixed in the sense that it no longer gets stuck @ 167 or 800 MHz but at the M3 value, in my case 1050MHz.
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (c): 54.0
GPU[0] : Temperature (Sensor junction) (c): 56.0
GPU[0] : Temperature (Sensor mem) (c): 52.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 3 (1050Mhz)
GPU[0] : pcie clock level: 1 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 4 (900Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0] : Performance Level: manual
================================================================================
================================================================================
================================================================================
================================================================================
GPU[0] : GPU Memory OverDrive value (%): 12
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0] : Max Graphics Package Power (W): 320.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 27.0
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 1050Mhz *
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16
GPU[0] : 1: 8.0GT/s, x16 *
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz *
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1050Mhz
GPU[0] : 3: 1125Mhz
GPU[0] : 4: 1200Mhz
GPU[0] : 5: 1275Mhz
GPU[0] : 6: 1375Mhz
GPU[0] : 7: 1450Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz *
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : GPU use (%): 1
================================================================================
================================================================================
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 900mV
GPU[0] : 2: 1050Mhz 910mV
GPU[0] : 3: 1125Mhz 920mV
GPU[0] : 4: 1200Mhz 930mV
GPU[0] : 5: 1275Mhz 940mV
GPU[0] : 6: 1375Mhz 945mV
GPU[0] : 7: 1450Mhz 950mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 850mV
GPU[0] : 2: 800Mhz 910mV
GPU[0] : 3: 1050Mhz 950mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
================================================================================
GPU[0] : Voltage (mV): 1200
================================================================================
==============================End of ROCm SMI Log ==============================
For mclk getting stuck at level 3, does it happen if you do "--setmclk 0 1 2 3"? --setmclk 3 just tells it to stick at level 3, while --setmclk 0 1 2 3 will let it alternate between levels based on workload. As for the voltage sticking at 1200mV, that's definitely more concerning and I'll see what I can reproduce here.
Hi Kent,
Thank you for your continued assistance here. To be sure, I rebooted the machine and decided to start step by step from scratch, documenting every move I make (so to speak).
Output of /opt/rocm/bin/rocm-smi -a before I do anything (GPU idle):
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (c): 50.0
GPU[0] : Temperature (Sensor junction) (c): 52.0
GPU[0] : Temperature (Sensor mem) (c): 50.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 0 (167Mhz)
GPU[0] : pcie clock level: 0 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 79 (30%)
================================================================================
================================================================================
GPU[0] : Performance Level: auto
================================================================================
================================================================================
GPU[0] : GPU OverDrive value (%): 0
================================================================================
================================================================================
GPU[0] : GPU Memory OverDrive value (%): 0
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0] : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 6.0
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz *
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 945Mhz
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16 *
GPU[0] : 1: 8.0GT/s, x16
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz *
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1084Mhz
GPU[0] : 3: 1138Mhz
GPU[0] : 4: 1200Mhz
GPU[0] : 5: 1401Mhz
GPU[0] : 6: 1536Mhz
GPU[0] : 7: 1630Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : GPU use (%): 3
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 900mV
GPU[0] : 2: 1084Mhz 950mV
GPU[0] : 3: 1138Mhz 1000mV
GPU[0] : 4: 1200Mhz 1050mV
GPU[0] : 5: 1401Mhz 1100mV
GPU[0] : 6: 1536Mhz 1150mV
GPU[0] : 7: 1630Mhz 1200mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 800mV
GPU[0] : 2: 800Mhz 950mV
GPU[0] : 3: 945Mhz 1100mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
================================================================================
GPU[0] : Voltage (mV): 787
================================================================================
==============================End of ROCm SMI Log ==============================
Then I set performance level to manual as follows:
/opt/rocm/bin/rocm-smi --setperflevel manual
Output:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : Successfully set current Performance Level to manual
================================================================================
==============================End of ROCm SMI Log ==============================
Then I execute the following commands:
/opt/rocm/bin/rocm-smi --setslevel 0 852 800 ; /opt/rocm/bin/rocm-smi --setslevel 1 991 900 ; /opt/rocm/bin/rocm-smi --setslevel 2 1050 910 ; /opt/rocm/bin/rocm-smi --setslevel 3 1125 920 ; /opt/rocm/bin/rocm-smi --setslevel 4 1200 930 ; /opt/rocm/bin/rocm-smi --setslevel 5 1275 940 ; /opt/rocm/bin/rocm-smi --setslevel 6 1375 955; /opt/rocm/bin/rocm-smi --setslevel 7 1450 975
and:
/opt/rocm/bin/rocm-smi --setmlevel 0 167 800 ; /opt/rocm/bin/rocm-smi --setmlevel 1 500 850 ; /opt/rocm/bin/rocm-smi --setmlevel 2 800 910 ; /opt/rocm/bin/rocm-smi --setmlevel 3 1025 975
Then:
opt/rocm/bin/rocm-smi --setmclk 0 1 2 3
They all execute successfully.
Output of /opt/rocm/bin/rocm-smi - a:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (c): 47.0
GPU[0] : Temperature (Sensor junction) (c): 48.0
GPU[0] : Temperature (Sensor mem) (c): 47.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 0 (167Mhz)
GPU[0] : pcie clock level: 1 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 0 (600Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 79 (30%)
================================================================================
================================================================================
GPU[0] : Performance Level: manual
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0] : GPU Memory OverDrive value (%): 9
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0] : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 6.0
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz *
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 1025Mhz
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16 *
GPU[0] : 1: 8.0GT/s, x16
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz *
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1050Mhz
GPU[0] : 3: 1125Mhz
GPU[0] : 4: 1200Mhz
GPU[0] : 5: 1275Mhz
GPU[0] : 6: 1375Mhz
GPU[0] : 7: 1450Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : GPU use (%): 14
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 900mV
GPU[0] : 2: 1050Mhz 910mV
GPU[0] : 3: 1125Mhz 920mV
GPU[0] : 4: 1200Mhz 930mV
GPU[0] : 5: 1275Mhz 940mV
GPU[0] : 6: 1375Mhz 955mV
GPU[0] : 7: 1450Mhz 975mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 850mV
GPU[0] : 2: 800Mhz 910mV
GPU[0] : 3: 1025Mhz 975mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
================================================================================
GPU[0] : Voltage (mV): 850
================================================================================
==============================End of ROCm SMI Log ==============================
Then I run sudo watch -n 0.5 cat /sys/kernel/debug/dri/0/amdgpu_pm_info in one terminal and the heaven benchmark in another.
Load:
Idle:
again Output of /opt/rocm/bin/rocm-smi - a:
========================ROCm System Management Interface========================
================================================================================
GPU[0] : GPU ID: 0x687f
================================================================================
================================================================================
GPU[0] : Temperature (Sensor edge) (c): 54.0
GPU[0] : Temperature (Sensor junction) (c): 56.0
GPU[0] : Temperature (Sensor mem) (c): 53.0
================================================================================
================================================================================
GPU[0] : dcefclk clock level: 0 (600Mhz)
GPU[0] : mclk clock level: 3 (1025Mhz)
GPU[0] : pcie clock level: 1 (8.0GT/s, x16)
GPU[0] : sclk clock level: 0 (852Mhz)
GPU[0] : socclk clock level: 6 (1028Mhz)
================================================================================
================================================================================
GPU[0] : Fan Level: 61 (23%)
================================================================================
================================================================================
GPU[0] : Performance Level: manual
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU OverDrive value
================================================================================
================================================================================
GPU[0] : GPU Memory OverDrive value (%): 9
================================================================================
Driver version: 5.0.76
================================================================================
GPU[0] : Max Graphics Package Power (W): 247.0
================================================================================
================================================================================
GPU[0] :
GPU[0] : NUM MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[0] : 0 BOOTUP_DEFAULT*: 70 60 0 0
GPU[0] : 1 3D_FULL_SCREEN : 70 60 1 3
GPU[0] : 2 POWER_SAVING : 90 60 0 0
GPU[0] : 3 VIDEO : 70 60 0 0
GPU[0] : 4 VR : 70 90 0 0
GPU[0] : 5 COMPUTE : 30 60 0 6
GPU[0] : 6 CUSTOM : 0 0 0 0
================================================================================
================================================================================
GPU[0] : Average Graphics Package Power (W): 15.0
================================================================================
================================================================================
GPU[0] : Supported dcefclk frequencies on GPU0
GPU[0] : 0: 600Mhz *
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 167Mhz
GPU[0] : 1: 500Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 1025Mhz *
GPU[0] :
GPU[0] : Supported pcie frequencies on GPU0
GPU[0] : 0: 8.0GT/s, x16
GPU[0] : 1: 8.0GT/s, x16 *
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 852Mhz *
GPU[0] : 1: 991Mhz
GPU[0] : 2: 1050Mhz
GPU[0] : 3: 1125Mhz
GPU[0] : 4: 1200Mhz
GPU[0] : 5: 1275Mhz
GPU[0] : 6: 1375Mhz
GPU[0] : 7: 1450Mhz
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 600Mhz
GPU[0] : 1: 720Mhz
GPU[0] : 2: 800Mhz
GPU[0] : 3: 847Mhz
GPU[0] : 4: 900Mhz
GPU[0] : 5: 960Mhz
GPU[0] : 6: 1028Mhz *
GPU[0] : 7: 1107Mhz
GPU[0] :
================================================================================
================================================================================
GPU[0] : GPU use (%): 7
================================================================================
================================================================================
ERROR: GPU[0] : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[0] : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[0] : Unique ID: 0213fa86776a08a4
================================================================================
================================================================================
GPU[0] : Serial Number: N/A
================================================================================
PIDs for KFD processes:
================================================================================
GPU[0] : OD_SCLK:
GPU[0] : 0: 852Mhz 800mV
GPU[0] : 1: 991Mhz 900mV
GPU[0] : 2: 1050Mhz 910mV
GPU[0] : 3: 1125Mhz 920mV
GPU[0] : 4: 1200Mhz 930mV
GPU[0] : 5: 1275Mhz 940mV
GPU[0] : 6: 1375Mhz 955mV
GPU[0] : 7: 1450Mhz 975mV
GPU[0] : OD_MCLK:
GPU[0] : 0: 167Mhz 800mV
GPU[0] : 1: 500Mhz 850mV
GPU[0] : 2: 800Mhz 910mV
GPU[0] : 3: 1025Mhz 975mV
GPU[0] : OD_RANGE:
GPU[0] : SCLK: 852MHz 2400MHz
GPU[0] : MCLK: 167MHz 1500MHz
GPU[0] : VDDC: 800mV 1200mV
================================================================================
================================================================================
GPU[0] : Voltage (mV): 1000
================================================================================
==============================End of ROCm SMI Log ==============================
In this case voltage is stuck at 1000mV most likely because I didn't up the power limit to 330 watts.
Out of curiousity, does dmesg and up throwing any information at all? I am hesitant to enable dynamic debugging because there will be an absolute boatload of extra information in there, but the plain dmesg might throw something from powerplay that might help a bit too.
The only thing I see related to powerplay is this:
[ 3.019956] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega10_smu
Full dmesg attached.
Darn it was worth a shot. I'll get on this once I finish my current task, and see what I can find out. Thanks for your continued help and information!
Hi mate,
This patch fixes the issue for me. would be great to see it applied upstream.
the patch has been applied to staging.
great stuff guys
I consider this resolved.
I have a Gigabyte Radeon RX Vega 64 GAMING OC 8GB and I have been playing with undervolting/overclocking the card for a better performance/power draw ratio. I have set the kernel boot parameter correctly and modified the voltages/frequencies as follows:
I have also set the power limit to +50%.
However, the card completely ignores those values and the voltage stays at 1200mV with power draw of 330 watts.
Any idea what could be the problem?