Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

No fan speed reading with RX 5600xt #71

Closed csecht closed 4 years ago

csecht commented 4 years ago

I just installed an RX 5600XT, along with amdgpu-pro 20.1 OpenCL components, to run Einstein@Home. Initially the card's fans were off up to a GPU temp of 80 C when crunching an E@H task. Running amdgpu-ls gave "Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage" (below). I turned on the fans by running amdgpu-pac to set fan speed to 30%. That worked and brought temp down to 46 C, but with the same error msg (below). The practical problem is that Fan Spd in -monitor and in -pac shows as 0%, so I can't tell what the current fan speed is.

$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   Card Path: /sys/class/drm/card0/device

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x731f', 'subsystem_vendor': '0x1da2', 'subsystem_device': '0xe411'}
   Decoded Device ID: Navi 10 [Radeon RX 5700 / 5700 XT]
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5700 / 5700 XT] (rev ca)
   Display Card Model: Navi 10 [Radeon RX 5700 / 5700 XT]
   PCIe ID: 03:00.0
      Link Speed: 16 GT/s
      Link Width: 16
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-5E4111U-X4G
   Compute Platform: OpenCL 2.0 AMD-APP (3075.10)
   GPU Frequency/Voltage Control Type: 2
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 79.0
   Power Cap (W): 160.0
      Power Cap Range (W): [0, 192]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Fan Target Speed (rpm): 0
   Current Fan Speed (rpm): 0
   Current Fan PWM (%): 0
      Fan Speed Range (rpm): [0, 3200]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 62
   Current Memory Loading (%): 30
   Current Temps (C): {'mem': 60.0, 'edge': 46.0, 'junction': 47.0}
      Critical Temp (C): 118.0
   Current Voltages (V): {'vddgfx': 950}
   Current Clk Frequencies (MHz): {'sclk': 1780.0, 'mclk': 875.0}
   Current SCLK P-State: [2, '1780Mhz']
      SCLK Range: ['800Mhz', '1820Mhz']
   Current MCLK P-State: [3, '875Mhz']
      MCLK Range: ['625Mhz', '930Mhz']
   Power Profile Mode: 5-COMPUTE
   Power DPM Force Performance Level: manual
$ ./amdgpu-pac --execute
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-master/pac_writer_99c0cfbf059042e68c31a899838cced5.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
[sudo] password for craig: 
+ sudo sh -c echo '76' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
PAC execution complete.
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Error: Invalid pstate entry: /sys/class/drm/card1/devicepp_od_clk_voltage
Ricks-Lab commented 4 years ago

That's really strange. It looks like os.path.join is not working properly in this case: file_path = os.path.join(self.prm.card_path, 'pp_od_clk_voltage')

Can you check the contents of the /sys/class/drm/card1/device directory to see if there is anything strange?

Ricks-Lab commented 4 years ago

I found the cause of the perceived path.join issue. I fixed it, but its not the root cause of your problem. I added a debug statement which should give more insight on the p-state read error. Please try the latest. Are this error and the fan speed read issue new to 20.10? Your output doesn't show any fan read error, so not sure what the issue is. Maybe next step is to run with --debug.

csecht commented 4 years ago

Yes, the latest Master doesn't have the error msg showing. I had been running 20.10 with RX 570s without issue. The fan speed readout problem started when I replaced them with my new RX 5600xt (which amdgpu-util is calling a RX 5700/RX 5700 XT). This is the contents of the device directory:

$ ls /sys/class/drm/card1/device/
aer_dev_correctable       i2c-3                    pcie_bw                            rescan
aer_dev_fatal             i2c-5                    pcie_replay_count                  reset
aer_dev_nonfatal          i2c-7                    power                              resource
ari_enabled               i2c-9                    power_dpm_force_performance_level  resource0
boot_vga                  irq                      power_dpm_state                    resource0_wc
broken_parity_status      local_cpulist            pp_cur_state                       resource2
class                     local_cpus               pp_dpm_dcefclk                     resource2_wc
config                    max_link_speed           pp_dpm_fclk                        resource4
consistent_dma_mask_bits  max_link_width           pp_dpm_mclk                        resource5
current_link_speed        mem_busy_percent         pp_dpm_pcie                        revision
current_link_width        mem_info_gtt_total       pp_dpm_sclk                        rom
d3cold_allowed            mem_info_gtt_used        pp_dpm_socclk                      subsystem
device                    mem_info_vis_vram_total  pp_features                        subsystem_device
dma_mask_bits             mem_info_vis_vram_used   pp_force_state                     subsystem_vendor
driver                    mem_info_vram_total      pp_mclk_od                         uevent
driver_override           mem_info_vram_used       pp_num_states                      usbc_pd_fw
drm                       mem_info_vram_vendor     pp_od_clk_voltage                  vbios_version
enable                    modalias                 pp_power_profile_mode              vendor
fw_version                msi_bus                  pp_sclk_od
gpu_busy_percent          msi_irqs                 pp_table
hwmon                     numa_node                remove

and of hwmon3:

$ ls /sys/class/drm/card1/device/hwmon/hwmon3/
device       freq1_input  name            pwm1         temp1_crit_hyst  temp2_emergency  temp3_input
fan1_enable  freq1_label  power           pwm1_enable  temp1_emergency  temp2_input      temp3_label
fan1_input   freq2_input  power1_average  pwm1_max     temp1_input      temp2_label      uevent
fan1_max     freq2_label  power1_cap      pwm1_min     temp1_label      temp3_crit
fan1_min     in0_input    power1_cap_max  subsystem    temp2_crit       temp3_crit_hyst
fan1_target  in0_label    power1_cap_min  temp1_crit   temp2_crit_hyst  temp3_emergency

File contents: pwm1_enable = 1; fan1_enable = 1; fan1_input = 0; fan_target = 0 Maybe related that the card fans were not running initially; they started running when I manually set them with amdgpu-pac. With the RX 570, the fans would automatically run to maintain a constant 74 C GPU temp, without any amdgpu-pac configuration.

Probably unrelated, but I see from amdgpu-ls that RX 5600 in PCIe slot 1 on the motherboard is listed as PCIe ID: 03:00.0. The card is the only PCIe device on the mobo. When a RX 570 was in that first slot, it was listed as PCIe ID: 01:00.0, as expected.

I'll try running --debug.

csecht commented 4 years ago

I have solved the problem, maybe. While entering "reset" for fan speed in the PAC window yesterday did not change anything, today it did and current fan speed is now read by monitor and pac. Prior to this fix, I ran --debug for monitor and pac and everything related to fan parameters was valid.

Ricks-Lab commented 4 years ago

@csecht Looks like there may not be a bug to fix here, but I did some major changes to replace debug statements with a debug logger. Now a log file will be produced when the --debug option is used. This can be used for future problem debug. Let me know if if see any other issues with the latest on master.

csecht commented 4 years ago

The debug logger works well. (I tested -ls, -pac, -monitor) A separate issue I found for this Navi card is that I'm not getting mV readings for the VDDC_CURVE in amdgpu-ls --pstates or amdgpu-pac (below).

$ ./amdgpu-ls --pstates
Ubuntu: Validated
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 1
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5700 / 5700 XT] (rev ca)
   Card Path: /sys/class/drm/card1/device
   GPU Frequency/Voltage Control Type: 2
   SCLK:                   MCLK:
    0:  300Mhz              0:  100Mhz  
    1:  1040Mhz             1:  500Mhz  
    2:  1780Mhz             2:  625Mhz  
   SCLK:                   MCLK:
    0:  800Mhz    -         
    1:  1780Mhz   -         1:  875MHz    -       
   VDDC_CURVE:
    0: ['800MHz', '@']
    1: ['1290MHz', '@']
    2: ['1780MHz', '@']

rx5600xt_PAC_25May

csecht commented 4 years ago

Oh, and the other issue is that the card's display name in -ls is "Navi 10 [Radeon RX 5700 / 5700 XT]" and in -monitor only "Navi 10 [Radeon" fits in the terminal window, whereas in my E@H account page it correctly lists as "Radeon RX 5600 XT".

Ricks-Lab commented 4 years ago

Can you post the results of cat /sys/class/drm/card1/device/pp_od_clk_voltage?

For the card name issue, have you executed sudo update-pciids?

Ricks-Lab commented 4 years ago

@csecht I have updated the latest on master to include the contents of the pp_od_clk_voltage file in the debug logger. My concern is that maybe AMD made a change in the way what I am calling type 2 controlled cards work since Radeon VII.

csecht commented 4 years ago

Ah, yes, I forgot about update-pciids. I downloaded the Master from yesterday. The display card model now lists as Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]. The narrow display window of -monitor, however, still shows only Navi 10 [Radeon, which is fine for me b/c it's the only card in there at the moment. The full name shows in the PAC window.

Here is pp_od_clk_voltage; these values are for the 'Quiet' BIOS on my dual-BIOS Sapphire Pulse card, which are a bit different from values with the default BIOS:

$ cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_SCLK:
0: 800Mhz
1: 1780Mhz
OD_MCLK:
1: 875MHz
OD_VDDC_CURVE:
0: 800MHz @ 707mV
1: 1290MHz @ 750mV
2: 1780MHz @ 959mV
OD_RANGE:
SCLK:     800Mhz       1820Mhz
MCLK:     625Mhz        930Mhz
VDDC_CURVE_SCLK[0]:     800Mhz       1820Mhz
VDDC_CURVE_VOLT[0]:     800mV        1050mV
VDDC_CURVE_SCLK[1]:     800Mhz       1820Mhz
VDDC_CURVE_VOLT[1]:     800mV        1050mV
VDDC_CURVE_SCLK[2]:     800Mhz       1820Mhz
VDDC_CURVE_VOLT[2]:     800mV        1050mV

But -pac and --pstates, as I posted above, display @ instead of mV values.

Ricks-Lab commented 4 years ago

The @ sign in the VDDC_CURVE values is new. This is different from the way the output is for Radeon VII. I have made a minor change to deal with it and pushed to master. Let me know if it resolves the problem.

csecht commented 4 years ago

Yes! Problem resolved. Thanks.