ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
178 stars 55 forks source link

mclk stuck at level 3 after setting `--setperflevel high` #87

Closed brian-maher closed 3 years ago

brian-maher commented 4 years ago

Hi,

Bit of a weird one, but fully reproducable on my system. Upon first boot, my mclk sits at level 0 (167mhz).

If I set --setperflevel high and do anything with the card, this jumps to level 3. When the card is no longer in use, and i set --setperflevel low, the sclk goes back to level 0, but the mclk stays at level 3.

Despite rocm-smi indicating no extra power usage (it stays at 3w), my UPS shows an additional load of 2% (rated for 700w), so somewhere there is an additional 14w being consumed by the card which is being unreported.

Manually setting the mclk to 0 doesn't work. Manually changing the slck, however, does.

If I repeat the process using auto (e.g. reboot, --setperflevel auto , --setperflevel low, this doesn't seem to occur and the card settles back down to level 0 nicely.

The card is a Vega Frontier using ROCm Version: 3.3, rocm-smi installed package is: 1.0.0-199-rocm-rel-3.3-19-ga9d6426

Some further info from the card (you'll notice no load, and low perf mode but high mem clocks):

========================ROCm System Management Interface========================
Driver version: 5.4.8
================================================================================
GPU[1]      : GPU ID: 0x6863
================================================================================
================================================================================
GPU[1]      : VBIOS version: 113-D0501100-109
================================================================================
================================================================================
GPU[1]      : Temperature (Sensor edge) (C): 43.0
GPU[1]      : Temperature (Sensor junction) (C): 43.0
GPU[1]      : Temperature (Sensor mem) (C): 43.0
================================================================================
================================================================================
GPU[1]      : dcefclk clock level: 0 (600Mhz)
GPU[1]      : mclk clock level: 3 (945Mhz)
GPU[1]      : pcie clock level: 1 (8.0GT/s, x16)
GPU[1]      : sclk clock level: 0 (852Mhz)
GPU[1]      : socclk clock level: 4 (1028Mhz)
================================================================================
================================================================================
GPU[1]      : Fan Level: 51 (20%)
================================================================================
================================================================================
GPU[1]      : Performance Level: low
================================================================================
================================================================================
GPU[1]      : GPU OverDrive value (%): 0
================================================================================
================================================================================
GPU[1]      : GPU Memory OverDrive value (%): 0
================================================================================
================================================================================
GPU[1]      : Max Graphics Package Power (W): 220.0
================================================================================
================================================================================
GPU[1]      : 
GPU[1]      : NUM        MODE_NAME BUSY_SET_POINT FPS USE_RLC_BUSY MIN_ACTIVE_LEVEL
GPU[1]      :   0 BOOTUP_DEFAULT*:             70  60          0              0
GPU[1]      :   1 3D_FULL_SCREEN :             70  60          1              3
GPU[1]      :   2   POWER_SAVING :             90  60          0              0
GPU[1]      :   3          VIDEO :             70  60          0              0
GPU[1]      :   4             VR :             70  90          0              0
GPU[1]      :   5        COMPUTE :             30  60          0              6
GPU[1]      :   6         CUSTOM :              0   0          0              0
================================================================================
================================================================================
GPU[1]      : Average Graphics Package Power (W): 3.0
================================================================================
================================================================================
GPU[1]      : Supported dcefclk frequencies on GPU1
GPU[1]      : 0: 600Mhz *
GPU[1]      : 1: 720Mhz 
GPU[1]      : 2: 800Mhz 
GPU[1]      : 3: 900Mhz 
GPU[1]      : 
GPU[1]      : Supported mclk frequencies on GPU1
GPU[1]      : 0: 167Mhz 
GPU[1]      : 1: 500Mhz 
GPU[1]      : 2: 800Mhz 
GPU[1]      : 3: 945Mhz *
GPU[1]      : 
GPU[1]      : Supported pcie frequencies on GPU1
GPU[1]      : 0: 8.0GT/s, x16 
GPU[1]      : 1: 8.0GT/s, x16 *
GPU[1]      : 
GPU[1]      : Supported sclk frequencies on GPU1
GPU[1]      : 0: 852Mhz *
GPU[1]      : 1: 991Mhz 
GPU[1]      : 2: 1138Mhz 
GPU[1]      : 3: 1269Mhz 
GPU[1]      : 4: 1348Mhz 
GPU[1]      : 5: 1440Mhz 
GPU[1]      : 6: 1528Mhz 
GPU[1]      : 7: 1600Mhz 
GPU[1]      : 
GPU[1]      : Supported socclk frequencies on GPU1
GPU[1]      : 0: 600Mhz 
GPU[1]      : 1: 720Mhz 
GPU[1]      : 2: 847Mhz 
GPU[1]      : 3: 960Mhz 
GPU[1]      : 4: 1028Mhz *
GPU[1]      : 5: 1107Mhz 
GPU[1]      : 
================================================================================
================================================================================
GPU[1]      : GPU use (%): 0
================================================================================
================================================================================
ERROR: GPU[1]       : Unable to get GPU memory use.
================================================================================
================================================================================
GPU[1]      : GPU memory vendor: samsung
================================================================================
================================================================================
GPU[1]      : PCIe Replay Count: 0
================================================================================
================================================================================
GPU[1]      : Unique ID: 0213f2b91de40904
================================================================================
================================================================================
GPU[1]      : Serial Number: N/A
================================================================================
PIDs for KFD processes:

================================================================================
ERROR: GPU[1]       : Unable to display PowerPlay table
================================================================================
================================================================================
GPU[1]      : Voltage (mV): 956
================================================================================
================================================================================
GPU[1]      : PCI Bus: 0000:1b:00.0
================================================================================
================================================================================
GPU[1]      : ASD firmware version:     553648167
GPU[1]      : CE firmware version:      79
GPU[1]      : DMCU firmware version:    0
GPU[1]      : MC firmware version:      0
GPU[1]      : ME firmware version:      163
GPU[1]      : MEC firmware version:     33203
GPU[1]      : MEC2 firmware version:    33203
GPU[1]      : PFP firmware version:     187
GPU[1]      : RLC firmware version:     96
GPU[1]      : RLC SRLC firmware version:    0
GPU[1]      : RLC SRLG firmware version:    0
GPU[1]      : RLC SRLS firmware version:    0
GPU[1]      : SDMA firmware version:    432
GPU[1]      : SDMA2 firmware version:   432
GPU[1]      : SMC firmware version:     00.28.57.00
GPU[1]      : SOS firmware version:     0x0008025d
GPU[1]      : TA RAS firmware version:      0x00000000
GPU[1]      : TA XGMI firmware version:     0x00000000
GPU[1]      : UVD firmware version:     0x411d1100
GPU[1]      : VCE firmware version:     0x39040400
GPU[1]      : VCN firmware version:     0x00000000
================================================================================
================================================================================
GPU[1]      : Card series:      Vega 10 XTX [Radeon Vega Frontier Edition]
GPU[1]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]      : Card SKU:     D05011
================================================================================
================================================================================
================================================================================
GPU[1]      : Unable to display sclk range
GPU[1]      : Unable to display mclk range
GPU[1]      : Unable to display voltage range
GPU[1]      : Unable to get voltage curve
==============================End of ROCm SMI Log ==============================
kentrussell commented 4 years ago

This seems like a kernel bug, as the SMI is just a fancy interface for amdgpu's sysfs. What kind of monitor config are you looking at? I know that there are some weird bugs with multi-monitor and mclk depending on the monitor/connector/refresh rate

brian-maher commented 4 years ago

There isn't any monitor connected to the card.

It's passed through to a VMWare VM, which itself uses the default vmware display adapter for console output.

kentrussell commented 4 years ago

Thanks, that helps. Can you check dmesg after you set the clocks to low? I am hoping to see if there is an error regarding "Failed to upload..." from the PP table access. If not, is it possible to use the "auto" setting instead of "low" ? If you need "low" to work, I'd suggest raising a ticket with the kernel guys, since this looks like a kernel bug. But first, let's check the dmesg and see if there is an error trying to actually set the perf level to low or not

kentrussell commented 4 years ago

Sorry for the delay, this appears to be related to a known kernel bug. The 3.7 release would contain the fix for this, so if you can give the 3.7 kernel a shot, that should cover it. If it's still occurring, I think that it would likely get fixed when https://gitlab.freedesktop.org/drm/amd/-/issues/801 gets fixed, but I am hoping that the other DPM cleanup has addressed it (since that bug report is about a laptop)

kentrussell commented 3 years ago

Closing this as 3.7 resolved this issue. If you have any issues, please open a new issue at https://github.com/RadeonOpenCompute/rocm_smi_lib, as this repo will be deprecated and all SMI CLI functionality has moved over there. Thank you!