Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
137 stars 23 forks source link

Radeon VII and possible glibc issue #8

Closed jf3player closed 5 years ago

jf3player commented 5 years ago

Hi Rick. This is Sean from the s@h boards. Card 1 is a Vega 56. Card 0 is a Radeon VII. System is Ubuntu 18.04 LTS, Ubuntu GLIBC 2.27-3ubuntu1 I'll include the output from amdgpu-ls (2.1.0) here as well:

./amdgpu-ls AMD Wattman features enabled: 0xffffffff amdgpu version: 18.50-725072 2 AMD GPUs detected Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap_max Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_cap Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/power1_average Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/temp1_input Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/temp1_crit Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_enable Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_target Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_input Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/fan1_max Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_enable Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1 Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/pwm1_max Error: HW file doesn't exist: /sys/class/drm/card1/device/hwmon/hwmon1/in0_label Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap_max Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/power1_average Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/temp1_input Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/temp1_crit Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_enable Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_target Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_input Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/fan1_max Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_enable Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_max Error: HW file doesn't exist: /sys/class/drm/card0/device/hwmon/hwmon0/in0_label 2 are Compatible

Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage UUID: 5d20111fb1d24b97a38ea653c57c55af Card Model: Vega 10 XT [Radeon RX Vega 64] Short Card Model: RX Vega 64 Card Number: 1 Card Path: /sys/class/drm/card1/device/ PCIe ID: 06:00.0 Driver: amdgpu HWmon: /sys/class/drm/card1/device/hwmon/hwmon1/ Current Power (W): -1 Power Cap (W): -1 Power Cap Range (W): [-1, -1] Fan Enable: -1 Fan PWM Mode: [-1, 'UNK'] Current Fan PWM (%): -1 Current Fan Speed (rpm): -1 Fan Target Speed (rpm): -1 Fan Speed Range (rpm): [-1, -1] Fan PWM Range (%): [-1, -1] Current Temp (C): -1 Critical Temp (C): -1 Current VddGFX (mV): -1 Vddc Range: ['800mV', '1200mV'] Current Loading (%): 83 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D0500300-101 Current SCLK P-State: 7 Current SCLK: 1590Mhz SCLK Range: ['852MHz', '2400MHz'] Current MCLK P-State: 3 Current MCLK: 800Mhz MCLK Range: ['167MHz', '1500MHz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto

UUID: 2ffbc1178e06458783b121e71dc487bd Card Model: Device 081e Short Card Model: Device 081e Card Number: 0 Card Path: /sys/class/drm/card0/device/ PCIe ID: 03:00.0 Driver: amdgpu HWmon: /sys/class/drm/card0/device/hwmon/hwmon0/ Current Power (W): -1 Power Cap (W): -1 Power Cap Range (W): [-1, -1] Fan Enable: -1 Fan PWM Mode: [-1, 'UNK'] Current Fan PWM (%): -1 Current Fan Speed (rpm): -1 Fan Target Speed (rpm): -1 Fan Speed Range (rpm): [-1, -1] Fan PWM Range (%): [-1, -1] Current Temp (C): -1 Critical Temp (C): -1 Current VddGFX (mV): -1 Vddc Range: ['', ''] Current Loading (%): 97 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D3600200-105 Current SCLK P-State: -1 Current SCLK: SCLK Range: ['808Mhz', '2200Mhz'] Current MCLK P-State: -1 Current MCLK: MCLK Range: ['351Mhz', '1200Mhz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto

Ricks-Lab commented 5 years ago

Hi Sean, Thanks for opening this issue! I probably don't track the SETI forums as much as I should, but curious of the background for updating GLIBC. What is the purpose and process? I checked the version on my system and it looks to be the same as yours, so wondering why your system lists it as if it were changed (from your SETI computer listing).

Can you check the contents of /sys/class/drm/card1/device/hwmon and /sys/class/drm/card1/device/hwmon/hwmon0/

The cards are supposed to have dpm enabled by default, but specifying amdgpu.dpm=1 can be used to enable it. I doubt this is the problem, since I did not have to enable it on systems I have worked with.

Also, I just noticed the the revision of my amdgpu drivers is slightly different that your: 18.50-708488. I will try updating mine to the latest to make sure that it is not the cause.

Ricks-Lab commented 5 years ago

I just upgraded to 18.50-725072 with no issues.

jf3player commented 5 years ago

Can't say that I (knowingly) updated GLIBC. The Ubuntu install was downloaded and installed clean early February. /sys/class/drm/card1/device/hwmon is empty except for subfolder hwmon2! hwmon0 does not exist. hwmon2 contains

subfolders: device, power, and subsystem files: fan1_enable, fan1_input, fan1_max, fan1_min, fan1_target, in0_input, in0_label, name, power1_average, power1_cap, power1_cap_max, power1_cap_min, pwm1, pwm1_enable, pwm1_max, pwm1_min, temp1_crit, temp1_crit_hyst, temp1_input, uevent

Ricks-Lab commented 5 years ago

Your system seems to be behaving differently than everything I have read about what should be happening. The rule that I am using is the hwmonX file in the hwmon directory should have X equal to the card number. So card1 should have a hwmon1 directory. Is it possible that something went wrong in the driver installation? Can you try uninstalling and re-installing to see if the directory structure changes? Are you installing with the command: sudo ./amdgpu-install -y --opencl=pal

Did you have a third card installed during driver installation and remove it afterward, or something like that?

I know there were some people installing a different glibc for some reason at SETI, so when I saw your OS description including a glibc indicator, I assumed your were running with a custom one. My system doesn't show anything about glibc. But it looks like your glibc is the same as mine, so maybe that isn't an issue.

Ricks-Lab commented 5 years ago

I just checked rocm-smi and found it is not using the rule that I assumed. I will update my approach tomorrow.

Ricks-Lab commented 5 years ago

I have made the change in master, so it should find hwmon files in your case. Let me know if this fixes the hwmon read errors.

But this will not fix the Radeon VII p-state reading issue. Did you install Ubuntu using a download from this site: https://www.ubuntu.com/download/desktop

jf3player commented 5 years ago

Yes, 18.04.2 LTS from that link. And yes, everything looks to be showing now for the Vega 56! You are correct the VII is missing p-states and clock speeds. The GPU driver was installed with "sudo ./amdgpu-pro-install -y --opencl=pal,legacy"

No problems with the install that I'm aware of and no cards have been added or removed since installing Ubuntu. I do have a PCI wireless NIC in the system. If I have time this weekend, I plan to pull the wireless card out and can try a fresh re-install of Ubuntu and AMD driver.

amdgpu-ls output is now:

./amdgpu-ls AMD Wattman features enabled: 0xffff7fff amdgpu version: 18.50-725072 2 AMD GPUs detected 2 are Compatible

Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage UUID: ec148bdcdc5542ff9195b85bd6a6c62f Device ID: {'vendor': '0x1002', 'device': '0x687f', 'subsystem_vendor': '0x1002', 'subsystem_device': '0x6b76'} Decoded Device ID: Vega 10 XL/XT [Radeon RX Vega 56/64] Card Model: Vega 10 XT [Radeon RX Vega 64] (rev c3) Short Card Model: RX Vega 64 Display Card Model: RX Vega 64 Card Number: 1 Card Path: /sys/class/drm/card1/device/ PCIe ID: 06:00.0 Driver: amdgpu HWmon: /sys/class/drm/card1/device/hwmon/hwmon2/ Current Power (W): 166.0 Power Cap (W): 165.0 Power Cap Range (W): [0, 165] Fan Enable: 0 Fan PWM Mode: [2, 'Dynamic'] Current Fan PWM (%): 47 Current Fan Speed (rpm): 2390 Fan Target Speed (rpm): 2390 Fan Speed Range (rpm): [400, 4900] Fan PWM Range (%): [0, 100] Current Temp (C): 82.0 Critical Temp (C): 89.0 Current VddGFX (mV): 987 Vddc Range: ['800mV', '1200mV'] Current Loading (%): 98 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D0500300-101 Current SCLK P-State: 4 Current SCLK: 1312Mhz SCLK Range: ['852MHz', '2400MHz'] Current MCLK P-State: 3 Current MCLK: 800Mhz MCLK Range: ['167MHz', '1500MHz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto

UUID: 63ab777f0500438580d783b9978c6b8a Device ID: {'vendor': '0x1002', 'device': '0x66af', 'subsystem_vendor': '0x1002', 'subsystem_device': '0x081e'} Decoded Device ID: Vega 20 [Radeon VII] Card Model: Vega 20 (rev c1) Short Card Model: Vega 20 (rev c1) Display Card Model: Vega 20 (rev c1) Card Number: 0 Card Path: /sys/class/drm/card0/device/ PCIe ID: 03:00.0 Driver: amdgpu HWmon: /sys/class/drm/card0/device/hwmon/hwmon1/ Current Power (W): 191.0 Power Cap (W): 250.0 Power Cap Range (W): [0, 250] Fan Enable: 0 Fan PWM Mode: [2, 'Dynamic'] Current Fan PWM (%): 74 Current Fan Speed (rpm): 2921 Fan Target Speed (rpm): 2921 Fan Speed Range (rpm): [0, 3850] Fan PWM Range (%): [0, 100] Current Temp (C): 108.0 Critical Temp (C): 118.0 Current VddGFX (mV): 1093 Vddc Range: ['', ''] Current Loading (%): 100 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D3600200-105 Current SCLK P-State: -1 Current SCLK: SCLK Range: ['808Mhz', '2200Mhz'] Current MCLK P-State: -1 Current MCLK: MCLK Range: ['351Mhz', '1200Mhz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto

Ricks-Lab commented 5 years ago

I suspect that the NIC you have installed has more than one HMON, resulting in the numbers being out of sync. This is perfectly legal and it was my code that lacked the robustness to deal with it. Thanks for the help in getting it resolved!

Have you tried rocm-smi? It would be interesting to see if the Radeon VII card has the same problem with the AMD software.

Ricks-Lab commented 5 years ago

I was able to get access to a Radeon VII card and found output for some of the driver files is quite different from older GPUs. It looks like there is are significant differences in the way P-states and PPM modes are handled. Here are examples the 2 relevant files: _pp_od_clkvoltage

OD_SCLK:
0:        806Mhz
1:       1736Mhz
OD_MCLK:
1:       1000Mhz
OD_VDDC_CURVE:
0:        806Mhz        734mV
1:       1271Mhz        821mV
2:       1736Mhz       1066mV
OD_RANGE:
SCLK:     806Mhz       2000Mhz
MCLK:     168Mhz       1200Mhz
VDDC_CURVE_SCLK[0]:     806Mhz       2000Mhz
VDDC_CURVE_VOLT[0]:     738mV        1218mV
VDDC_CURVE_SCLK[1]:     806Mhz       2000Mhz
VDDC_CURVE_VOLT[1]:     738mV        1218mV
VDDC_CURVE_SCLK[2]:     806Mhz       2000Mhz
VDDC_CURVE_VOLT[2]:     738mV        1218mV

_pp_power_profilemode

PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
 0 3D_FULL_SCREEN :
                    0(       GFXCLK)       0       1       2       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       1       4     850       4     800  327680  -65536       0
                    2(         UCLK)       0       1       4     850       4     800  327680  -65536       0
                    3(         FCLK)       0       1       4     850       4     800  327680  -65536       0
 1   POWER_SAVING :
                    0(       GFXCLK)       0       0       1       0       3       0 5898240  -65536       0
                    1(       SOCCLK)       0       0       1       0       3       0 1310720   -6553       0
                    2(         UCLK)       0       0       1       0       3       0 1966080  -65536       0
                    3(         FCLK)       0       0       0       0       3     800 1966080   -6553       0
 2          VIDEO*:
                    0(       GFXCLK)       0       1       1       0       4     500 4587520   -6553       0
                    1(       SOCCLK)       0       0       1       0       4     500 1310720   -6553       0
                    2(         UCLK)       0       0       1       0       4     500 1966080  -65536       0
                    3(         FCLK)       0       0       3       0       4     500 1966080   -6553       0
 3             VR :
                    0(       GFXCLK)       0       1       0    1540       4     800 5898240   -6553   65536
                    1(       SOCCLK)       0       1       2       0       4     800  327680  -32768  -65536
                    2(         UCLK)       0       1       2       0       4     800  327680  -32768  -65536
                    3(         FCLK)       0       1       2       0       4     800  327680  -32768  -65536
 4        COMPUTE :
                    0(       GFXCLK)       0       1       0    1600       3       0 3932160  -65536  -65536
                    1(       SOCCLK)       0       0       4     850       3       0  327680  -65536  -32768
                    2(         UCLK)       0       0       4     850       3       0  327680  -65536  -32768
                    3(         FCLK)       0       0       4     850       3       0  327680  -65536  -32768
 5         CUSTOM :
                    0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
                    2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
                    3(         FCLK)       0       0       0       0       4     800  327680   -6553       0

It will take some time for me to figure out how Freq vs. Voltage works, so in the meantime, I plan to classify the 2 different variations and limit the functionality of amdgpu-pac to basic parameters for the new type.

Ricks-Lab commented 5 years ago

@jf3player The latest on master has basic functionality for Radeon VII. You can control power cap, fan speed, and ppm mode. Let me know if you get a chance to try it out.

jf3player commented 5 years ago

Sorry I was away for a while, but I'm back up and running S@H again. I've removed the wireless card (didn't really need it) so the 56 and VII are the only two add-on cards in the system now. I also did a clean Ubuntu install after the hardware change. I have not tried rocm-smi. PAC does run now! However, core voltage changes don't seem to take for the Vega 56. I've tried changing state 7 to 1150mv (saving and applying under sudo appears to work), but the monitor utility still shows 1200mv. Upon relaunching PAC, it will still say 1150mv for state 7 though. I've also tried setting state 6 as the highest, but that didn't appear to work either. Thank you!

Ricks-Lab commented 5 years ago

Actually, it’s a feature related to powerplay that causes max voltage to be used in the highest pstate during high loading. The way I avoid this is to use the pstate masking feature so that the GPU doesn’t go into pstate 7. Then redefine pstate 6. Another issue is that the card being used for display will often start ignoring the pstate mask that was set. I don’t have a solution for this yet.

With the latest version, having the extra card in the system should not be a problem.

Ricks-Lab commented 5 years ago

@jf3player I have completed the implementation of amdgpu-pac for Radeon VII. A release candidate in on master branch now and will officially release as soon as testing is complete. Hope you can give it a try and let me know what you think. I suggest not modifying the Vddc curve and instead modify the Sclk and Mclk curve end points.

Ricks-Lab commented 5 years ago

Released working version compatible with Radeon VII. v2.4.0