Closed csecht closed 4 years ago
That's really strange. It looks like os.path.join is not working properly in this case:
file_path = os.path.join(self.prm.card_path, 'pp_od_clk_voltage')
Can you check the contents of the /sys/class/drm/card1/device
directory to see if there is anything strange?
I found the cause of the perceived path.join issue. I fixed it, but its not the root cause of your problem. I added a debug statement which should give more insight on the p-state read error. Please try the latest. Are this error and the fan speed read issue new to 20.10? Your output doesn't show any fan read error, so not sure what the issue is. Maybe next step is to run with --debug
.
Yes, the latest Master doesn't have the error msg showing. I had been running 20.10 with RX 570s without issue. The fan speed readout problem started when I replaced them with my new RX 5600xt (which amdgpu-util is calling a RX 5700/RX 5700 XT). This is the contents of the device directory:
$ ls /sys/class/drm/card1/device/
aer_dev_correctable i2c-3 pcie_bw rescan
aer_dev_fatal i2c-5 pcie_replay_count reset
aer_dev_nonfatal i2c-7 power resource
ari_enabled i2c-9 power_dpm_force_performance_level resource0
boot_vga irq power_dpm_state resource0_wc
broken_parity_status local_cpulist pp_cur_state resource2
class local_cpus pp_dpm_dcefclk resource2_wc
config max_link_speed pp_dpm_fclk resource4
consistent_dma_mask_bits max_link_width pp_dpm_mclk resource5
current_link_speed mem_busy_percent pp_dpm_pcie revision
current_link_width mem_info_gtt_total pp_dpm_sclk rom
d3cold_allowed mem_info_gtt_used pp_dpm_socclk subsystem
device mem_info_vis_vram_total pp_features subsystem_device
dma_mask_bits mem_info_vis_vram_used pp_force_state subsystem_vendor
driver mem_info_vram_total pp_mclk_od uevent
driver_override mem_info_vram_used pp_num_states usbc_pd_fw
drm mem_info_vram_vendor pp_od_clk_voltage vbios_version
enable modalias pp_power_profile_mode vendor
fw_version msi_bus pp_sclk_od
gpu_busy_percent msi_irqs pp_table
hwmon numa_node remove
and of hwmon3:
$ ls /sys/class/drm/card1/device/hwmon/hwmon3/
device freq1_input name pwm1 temp1_crit_hyst temp2_emergency temp3_input
fan1_enable freq1_label power pwm1_enable temp1_emergency temp2_input temp3_label
fan1_input freq2_input power1_average pwm1_max temp1_input temp2_label uevent
fan1_max freq2_label power1_cap pwm1_min temp1_label temp3_crit
fan1_min in0_input power1_cap_max subsystem temp2_crit temp3_crit_hyst
fan1_target in0_label power1_cap_min temp1_crit temp2_crit_hyst temp3_emergency
File contents: pwm1_enable = 1; fan1_enable = 1; fan1_input = 0; fan_target = 0 Maybe related that the card fans were not running initially; they started running when I manually set them with amdgpu-pac. With the RX 570, the fans would automatically run to maintain a constant 74 C GPU temp, without any amdgpu-pac configuration.
Probably unrelated, but I see from amdgpu-ls that RX 5600 in PCIe slot 1 on the motherboard is listed as PCIe ID: 03:00.0. The card is the only PCIe device on the mobo. When a RX 570 was in that first slot, it was listed as PCIe ID: 01:00.0, as expected.
I'll try running --debug.
I have solved the problem, maybe. While entering "reset" for fan speed in the PAC window yesterday did not change anything, today it did and current fan speed is now read by monitor and pac. Prior to this fix, I ran --debug for monitor and pac and everything related to fan parameters was valid.
@csecht Looks like there may not be a bug to fix here, but I did some major changes to replace debug statements with a debug logger. Now a log file will be produced when the --debug option is used. This can be used for future problem debug. Let me know if if see any other issues with the latest on master.
The debug logger works well. (I tested -ls, -pac, -monitor) A separate issue I found for this Navi card is that I'm not getting mV readings for the VDDC_CURVE in amdgpu-ls --pstates or amdgpu-pac (below).
$ ./amdgpu-ls --pstates
Ubuntu: Validated
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only
Card Number: 1
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5700 / 5700 XT] (rev ca)
Card Path: /sys/class/drm/card1/device
GPU Frequency/Voltage Control Type: 2
SCLK: MCLK:
0: 300Mhz 0: 100Mhz
1: 1040Mhz 1: 500Mhz
2: 1780Mhz 2: 625Mhz
SCLK: MCLK:
0: 800Mhz -
1: 1780Mhz - 1: 875MHz -
VDDC_CURVE:
0: ['800MHz', '@']
1: ['1290MHz', '@']
2: ['1780MHz', '@']
Oh, and the other issue is that the card's display name in -ls is "Navi 10 [Radeon RX 5700 / 5700 XT]" and in -monitor only "Navi 10 [Radeon" fits in the terminal window, whereas in my E@H account page it correctly lists as "Radeon RX 5600 XT".
Can you post the results of cat /sys/class/drm/card1/device/pp_od_clk_voltage
?
For the card name issue, have you executed sudo update-pciids
?
@csecht
I have updated the latest on master to include the contents of the pp_od_clk_voltage
file in the debug logger. My concern is that maybe AMD made a change in the way what I am calling type 2 controlled cards work since Radeon VII.
Ah, yes, I forgot about update-pciids. I downloaded the Master from yesterday. The display card model now lists as Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]. The narrow display window of -monitor, however, still shows only Navi 10 [Radeon
, which is fine for me b/c it's the only card in there at the moment. The full name shows in the PAC window.
Here is pp_od_clk_voltage
; these values are for the 'Quiet' BIOS on my dual-BIOS Sapphire Pulse card, which are a bit different from values with the default BIOS:
$ cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_SCLK:
0: 800Mhz
1: 1780Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 800MHz @ 707mV
1: 1290MHz @ 750mV
2: 1780MHz @ 959mV
OD_RANGE:
SCLK: 800Mhz 1820Mhz
MCLK: 625Mhz 930Mhz
VDDC_CURVE_SCLK[0]: 800Mhz 1820Mhz
VDDC_CURVE_VOLT[0]: 800mV 1050mV
VDDC_CURVE_SCLK[1]: 800Mhz 1820Mhz
VDDC_CURVE_VOLT[1]: 800mV 1050mV
VDDC_CURVE_SCLK[2]: 800Mhz 1820Mhz
VDDC_CURVE_VOLT[2]: 800mV 1050mV
But -pac and --pstates, as I posted above, display @ instead of mV values.
The @ sign in the VDDC_CURVE values is new. This is different from the way the output is for Radeon VII. I have made a minor change to deal with it and pushed to master. Let me know if it resolves the problem.
Yes! Problem resolved. Thanks.
I just installed an RX 5600XT, along with amdgpu-pro 20.1 OpenCL components, to run Einstein@Home. Initially the card's fans were off up to a GPU temp of 80 C when crunching an E@H task. Running amdgpu-ls gave "Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage" (below). I turned on the fans by running amdgpu-pac to set fan speed to 30%. That worked and brought temp down to 46 C, but with the same error msg (below). The practical problem is that Fan Spd in -monitor and in -pac shows as 0%, so I can't tell what the current fan speed is.