Closed jf3player closed 5 years ago
Hi Sean, Thanks for opening this issue! I probably don't track the SETI forums as much as I should, but curious of the background for updating GLIBC. What is the purpose and process? I checked the version on my system and it looks to be the same as yours, so wondering why your system lists it as if it were changed (from your SETI computer listing).
Can you check the contents of /sys/class/drm/card1/device/hwmon and /sys/class/drm/card1/device/hwmon/hwmon0/
The cards are supposed to have dpm enabled by default, but specifying amdgpu.dpm=1 can be used to enable it. I doubt this is the problem, since I did not have to enable it on systems I have worked with.
Also, I just noticed the the revision of my amdgpu drivers is slightly different that your: 18.50-708488. I will try updating mine to the latest to make sure that it is not the cause.
I just upgraded to 18.50-725072 with no issues.
Can't say that I (knowingly) updated GLIBC. The Ubuntu install was downloaded and installed clean early February. /sys/class/drm/card1/device/hwmon is empty except for subfolder hwmon2! hwmon0 does not exist. hwmon2 contains
subfolders: device, power, and subsystem files: fan1_enable, fan1_input, fan1_max, fan1_min, fan1_target, in0_input, in0_label, name, power1_average, power1_cap, power1_cap_max, power1_cap_min, pwm1, pwm1_enable, pwm1_max, pwm1_min, temp1_crit, temp1_crit_hyst, temp1_input, uevent
Your system seems to be behaving differently than everything I have read about what should be happening. The rule that I am using is the hwmonX file in the hwmon directory should have X equal to the card number. So card1 should have a hwmon1 directory. Is it possible that something went wrong in the driver installation? Can you try uninstalling and re-installing to see if the directory structure changes? Are you installing with the command: sudo ./amdgpu-install -y --opencl=pal
Did you have a third card installed during driver installation and remove it afterward, or something like that?
I know there were some people installing a different glibc for some reason at SETI, so when I saw your OS description including a glibc indicator, I assumed your were running with a custom one. My system doesn't show anything about glibc. But it looks like your glibc is the same as mine, so maybe that isn't an issue.
I just checked rocm-smi and found it is not using the rule that I assumed. I will update my approach tomorrow.
I have made the change in master, so it should find hwmon files in your case. Let me know if this fixes the hwmon read errors.
But this will not fix the Radeon VII p-state reading issue. Did you install Ubuntu using a download from this site: https://www.ubuntu.com/download/desktop
Yes, 18.04.2 LTS from that link. And yes, everything looks to be showing now for the Vega 56! You are correct the VII is missing p-states and clock speeds. The GPU driver was installed with "sudo ./amdgpu-pro-install -y --opencl=pal,legacy"
No problems with the install that I'm aware of and no cards have been added or removed since installing Ubuntu. I do have a PCI wireless NIC in the system. If I have time this weekend, I plan to pull the wireless card out and can try a fresh re-install of Ubuntu and AMD driver.
amdgpu-ls output is now:
./amdgpu-ls AMD Wattman features enabled: 0xffff7fff amdgpu version: 18.50-725072 2 AMD GPUs detected 2 are Compatible
Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage Error: Invalid pstate entry: /sys/class/drm/card0/device/pp_od_clk_voltage UUID: ec148bdcdc5542ff9195b85bd6a6c62f Device ID: {'vendor': '0x1002', 'device': '0x687f', 'subsystem_vendor': '0x1002', 'subsystem_device': '0x6b76'} Decoded Device ID: Vega 10 XL/XT [Radeon RX Vega 56/64] Card Model: Vega 10 XT [Radeon RX Vega 64] (rev c3) Short Card Model: RX Vega 64 Display Card Model: RX Vega 64 Card Number: 1 Card Path: /sys/class/drm/card1/device/ PCIe ID: 06:00.0 Driver: amdgpu HWmon: /sys/class/drm/card1/device/hwmon/hwmon2/ Current Power (W): 166.0 Power Cap (W): 165.0 Power Cap Range (W): [0, 165] Fan Enable: 0 Fan PWM Mode: [2, 'Dynamic'] Current Fan PWM (%): 47 Current Fan Speed (rpm): 2390 Fan Target Speed (rpm): 2390 Fan Speed Range (rpm): [400, 4900] Fan PWM Range (%): [0, 100] Current Temp (C): 82.0 Critical Temp (C): 89.0 Current VddGFX (mV): 987 Vddc Range: ['800mV', '1200mV'] Current Loading (%): 98 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D0500300-101 Current SCLK P-State: 4 Current SCLK: 1312Mhz SCLK Range: ['852MHz', '2400MHz'] Current MCLK P-State: 3 Current MCLK: 800Mhz MCLK Range: ['167MHz', '1500MHz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto
UUID: 63ab777f0500438580d783b9978c6b8a Device ID: {'vendor': '0x1002', 'device': '0x66af', 'subsystem_vendor': '0x1002', 'subsystem_device': '0x081e'} Decoded Device ID: Vega 20 [Radeon VII] Card Model: Vega 20 (rev c1) Short Card Model: Vega 20 (rev c1) Display Card Model: Vega 20 (rev c1) Card Number: 0 Card Path: /sys/class/drm/card0/device/ PCIe ID: 03:00.0 Driver: amdgpu HWmon: /sys/class/drm/card0/device/hwmon/hwmon1/ Current Power (W): 191.0 Power Cap (W): 250.0 Power Cap Range (W): [0, 250] Fan Enable: 0 Fan PWM Mode: [2, 'Dynamic'] Current Fan PWM (%): 74 Current Fan Speed (rpm): 2921 Fan Target Speed (rpm): 2921 Fan Speed Range (rpm): [0, 3850] Fan PWM Range (%): [0, 100] Current Temp (C): 108.0 Critical Temp (C): 118.0 Current VddGFX (mV): 1093 Vddc Range: ['', ''] Current Loading (%): 100 Link Speed: 8 GT/s Link Width: 16 vBIOS Version: 113-D3600200-105 Current SCLK P-State: -1 Current SCLK: SCLK Range: ['808Mhz', '2200Mhz'] Current MCLK P-State: -1 Current MCLK: MCLK Range: ['351Mhz', '1200Mhz'] Power Performance Mode: 2-VIDEO Power Force Performance Level: auto
I suspect that the NIC you have installed has more than one HMON, resulting in the numbers being out of sync. This is perfectly legal and it was my code that lacked the robustness to deal with it. Thanks for the help in getting it resolved!
Have you tried rocm-smi? It would be interesting to see if the Radeon VII card has the same problem with the AMD software.
I was able to get access to a Radeon VII card and found output for some of the driver files is quite different from older GPUs. It looks like there is are significant differences in the way P-states and PPM modes are handled. Here are examples the 2 relevant files: _pp_od_clkvoltage
OD_SCLK:
0: 806Mhz
1: 1736Mhz
OD_MCLK:
1: 1000Mhz
OD_VDDC_CURVE:
0: 806Mhz 734mV
1: 1271Mhz 821mV
2: 1736Mhz 1066mV
OD_RANGE:
SCLK: 806Mhz 2000Mhz
MCLK: 168Mhz 1200Mhz
VDDC_CURVE_SCLK[0]: 806Mhz 2000Mhz
VDDC_CURVE_VOLT[0]: 738mV 1218mV
VDDC_CURVE_SCLK[1]: 806Mhz 2000Mhz
VDDC_CURVE_VOLT[1]: 738mV 1218mV
VDDC_CURVE_SCLK[2]: 806Mhz 2000Mhz
VDDC_CURVE_VOLT[2]: 738mV 1218mV
_pp_power_profilemode
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
0 3D_FULL_SCREEN :
0( GFXCLK) 0 1 2 0 4 800 4587520 -65536 0
1( SOCCLK) 0 1 4 850 4 800 327680 -65536 0
2( UCLK) 0 1 4 850 4 800 327680 -65536 0
3( FCLK) 0 1 4 850 4 800 327680 -65536 0
1 POWER_SAVING :
0( GFXCLK) 0 0 1 0 3 0 5898240 -65536 0
1( SOCCLK) 0 0 1 0 3 0 1310720 -6553 0
2( UCLK) 0 0 1 0 3 0 1966080 -65536 0
3( FCLK) 0 0 0 0 3 800 1966080 -6553 0
2 VIDEO*:
0( GFXCLK) 0 1 1 0 4 500 4587520 -6553 0
1( SOCCLK) 0 0 1 0 4 500 1310720 -6553 0
2( UCLK) 0 0 1 0 4 500 1966080 -65536 0
3( FCLK) 0 0 3 0 4 500 1966080 -6553 0
3 VR :
0( GFXCLK) 0 1 0 1540 4 800 5898240 -6553 65536
1( SOCCLK) 0 1 2 0 4 800 327680 -32768 -65536
2( UCLK) 0 1 2 0 4 800 327680 -32768 -65536
3( FCLK) 0 1 2 0 4 800 327680 -32768 -65536
4 COMPUTE :
0( GFXCLK) 0 1 0 1600 3 0 3932160 -65536 -65536
1( SOCCLK) 0 0 4 850 3 0 327680 -65536 -32768
2( UCLK) 0 0 4 850 3 0 327680 -65536 -32768
3( FCLK) 0 0 4 850 3 0 327680 -65536 -32768
5 CUSTOM :
0( GFXCLK) 0 0 1 0 4 800 4587520 -65536 0
1( SOCCLK) 0 0 1 0 4 800 327680 -6553 0
2( UCLK) 0 0 1 0 4 800 327680 -65536 0
3( FCLK) 0 0 0 0 4 800 327680 -6553 0
It will take some time for me to figure out how Freq vs. Voltage works, so in the meantime, I plan to classify the 2 different variations and limit the functionality of amdgpu-pac to basic parameters for the new type.
@jf3player The latest on master has basic functionality for Radeon VII. You can control power cap, fan speed, and ppm mode. Let me know if you get a chance to try it out.
Sorry I was away for a while, but I'm back up and running S@H again. I've removed the wireless card (didn't really need it) so the 56 and VII are the only two add-on cards in the system now. I also did a clean Ubuntu install after the hardware change. I have not tried rocm-smi. PAC does run now! However, core voltage changes don't seem to take for the Vega 56. I've tried changing state 7 to 1150mv (saving and applying under sudo appears to work), but the monitor utility still shows 1200mv. Upon relaunching PAC, it will still say 1150mv for state 7 though. I've also tried setting state 6 as the highest, but that didn't appear to work either. Thank you!
Actually, it’s a feature related to powerplay that causes max voltage to be used in the highest pstate during high loading. The way I avoid this is to use the pstate masking feature so that the GPU doesn’t go into pstate 7. Then redefine pstate 6. Another issue is that the card being used for display will often start ignoring the pstate mask that was set. I don’t have a solution for this yet.
With the latest version, having the extra card in the system should not be a problem.
@jf3player I have completed the implementation of amdgpu-pac for Radeon VII. A release candidate in on master branch now and will officially release as soon as testing is complete. Hope you can give it a try and let me know what you think. I suggest not modifying the Vddc curve and instead modify the Sclk and Mclk curve end points.
Hi Rick. This is Sean from the s@h boards. Card 1 is a Vega 56. Card 0 is a Radeon VII. System is Ubuntu 18.04 LTS, Ubuntu GLIBC 2.27-3ubuntu1 I'll include the output from amdgpu-ls (2.1.0) here as well: