azeam / powerupp

Simple GUI for UPP
GNU General Public License v3.0
81 stars 7 forks source link

5700XT Fan RPM stays at 2000 upon applying undervolt? #20

Open sazaland opened 3 years ago

sazaland commented 3 years ago

I've been using this since it has better controls for voltage and clock than CoreCtl(now uninstalled) from what I saw. However I noticed once I apply my settings(clock 1905MHz, max voltage -100mV, static offset -20mV) from PowerUPP, the fan speed slowly rises to 2000 RPM and stays there when idling on the desktop. If I load and apply the default settings the fan speed returns to expected values for idle(around 800 RPM, but varying based on actual temps). Aside: I have not checked how the fans behave under load with this.

I find this baffling since PowerUPP isn't supposed to touch fan curve to my knowledge. I've totally powered off the system multiple times since removing CoreCtl so it shouldn't be in play anymore. What could be going on here?

azeam commented 3 years ago

Sounds odd and correct, PowerUPP does not alter the fan curve in any way. Are the temperatures normal or does it heat up after setting those values? What distro and kernel version are you using?

sazaland commented 3 years ago

No, the temperatures go down as you'd expect from underclocking(mine is an ASRock Challenger, factory OC'd to 2100 default), undervolting, and increasing the Fan RPM. My idle temps are more like 44-45C instead of 60C.

Distro is Slackware-current, kernel is 5.4.81

sibradzic commented 3 years ago

This is looks like a bug in card's SMU or in the way it interprets new PowerPlay table new clocks and existing fan settings. Can you share the output of upp dump | grep -i fan (assuming you have upp package installed)? You could try adjusting smc_pptable/FanStartTemp or smc_pptable/FanTargetTemperature or smc_pptable/FanTargetGfxclk...

azeam commented 3 years ago

I think I've seen something similar reported elsewhere a long time ago but I never looked into it (either the problem "disappeared" or the user didn't respond, can't remember). Check the upp dump | grep -i fan, as sibradzic wrote, before and after setting the values and see if there are any differences in the output. Is it possible to narrow down which of the settings (clock, max voltage, static offset) that causes this or is it any change to the PowerPlay table?

sazaland commented 3 years ago

Ok, I pulled them, the values captured in that syntax appear to be identical with the default values and my personal PowerUPP settings set. This is what it looks like for either of them.

root@slackbox:~# upp dump | grep -i fan
  FanStopTemp: 50
  FanStartTemp: 60
  FanGainEdge: 400
  FanGainHotspot: 400
  FanGainLiquid0: 400
  FanGainLiquid1: 400
  FanGainVrGfx: 400
  FanGainVrSoc: 400
  FanGainVrMem0: 400
  FanGainVrMem1: 400
  FanGainPlx: 400
  FanGainMem: 400
  FanPwmMin: 35
  FanAcousticLimitRpm: 2000
  FanThrottlingRpm: 2200
  FanMaximumRpm: 3200
  FanTargetTemperature: 85
  FanTargetGfxclk: 800
  FanTempInputSelect: 1
  FanPadding: 0
  FanZeroRpmEnable: 1
  FanTachEdgePerRev: 2
  FuzzyFan_ErrorSetDelta: 0
  FuzzyFan_ErrorRateSetDelta: 0
  FuzzyFan_PwmSetDelta: 0
  FuzzyFan_Reserved: 0
  MGpuFanBoostLimitRpm: 0

Since they're identical I'm not sure what to say.

azeam commented 3 years ago

I found that other report mentioned above, (s)he was using a reference 5700 XT, Ubuntu 20.04, kernel 5.4.0-29-generic and firmware from March 19 (17:37) and also experienced fan ramp up when lowering the Gfx clock, I fail to see the common denominator. Not really sure where to start looking but you could possibly try a more recent kernel/firmware? As a workaround you could try to limit the card only by power/voltage/offset instead of adjusting the Gfx clock (if the issues are identical), but it would be interesting to learn what causes this.

My card (MSI Gaming X 5700XT) has a bit different fan limits (varies by card model, see below) but I tried to change all the 1000s to 400 as you have just to see what would happen and I am not able to reproduce the issue.


FanStartTemp: 80  
FanGainEdge: 400  
FanGainHotspot: 1000  
FanGainLiquid0: 400  
FanGainLiquid1: 400  
FanGainVrGfx: 1000  
FanGainVrSoc: 1000  
FanGainVrMem0: 1000  
FanGainVrMem1: 1000  
FanGainPlx: 1000  
FanGainMem: 1000  
FanPwmMin: 20  
FanAcousticLimitRpm: 1200  
FanThrottlingRpm: 2000  
FanMaximumRpm: 2970  
FanTargetTemperature: 87  
FanTargetGfxclk: 800  
FanTempInputSelect: 1  
FanPadding: 0  
FanZeroRpmEnable: 1  
FanTachEdgePerRev: 2  
FuzzyFan_ErrorSetDelta: 0  
FuzzyFan_ErrorRateSetDelta: 0  
FuzzyFan_PwmSetDelta: 0  
FuzzyFan_Reserved: 0  
MGpuFanBoostLimitRpm: 0
sazaland commented 3 years ago

Ok.. it gets weirder. I have had a single boot for a while without powering off or rebooting, and have been suspending/sleeping the machine at night, and waking it in the morning. What I found was the fan curve was now reflecting the default settings, while my underclock and undervolt was still being used after waking(I confirmed this by using the Load Active function in PowerUPP to ensure it hadn't returned to default values after sleep/wake).

To investigate this a bit, I decided to try loading and applying the default settings.. and the fan speed steadily rose to and stayed at 2k RPM, now with the default clocks and voltages. It seems the 2k RPM fan speed has something to do with the change being made since the system awoke, not the specific settings in use, and suspending or otherwise disengaging the GPU and re-engaging it restores the original fan curve regardless of the the other settings active in PowerPlay tables..

I don't know if or how this would be relevant, but on Slackware sudo is installed but not configured by default, and I'm not using it. I'm applying PowerUPP's changes using the root password via PolicyKit's "su -c" style functionality, I get prompted graphically just like you would with sudo but provide the root password to PolicyKit instead of my own.

azeam commented 3 years ago

Can you see any error or other strange things in dmesg dmesg | grep amdgpu? As long as the settings are actually getting applied (which they seem to be if the changes are reflected when you re-load active settings after applying them) the sudo/polkit setup should not matter.

sazaland commented 3 years ago

Not sure if this would be relevant, 11298.xxxxx is where we first apply the settings from PowerUPP:

`root@slackbox:~# dmesg | grep amdgpu

[ 7.666267] [drm] amdgpu kernel modesetting enabled.

[ 7.673116] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff

[ 7.673667] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff

[ 7.674208] amdgpu 0000:09:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfc900000 -> 0xfc97ffff

[ 7.674742] fb0: switching to amdgpudrmfb from EFI VGA

[ 7.675332] amdgpu 0000:09:00.0: vgaarb: deactivate vga console

[ 7.702608] amdgpu 0000:09:00.0: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)

[ 7.702610] amdgpu 0000:09:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF

[ 7.702682] [drm] amdgpu: 8176M of VRAM memory ready

[ 7.702684] [drm] amdgpu: 8176M of GTT memory ready.

[ 8.379142] amdgpu: [powerplay] smu driver if version = 0x00000033, smu fw if version = 0x00000037, smu fw version = 0x002a3d00 (42.61.0)

[ 8.379147] amdgpu: [powerplay] SMU driver if version not matched

[ 8.382927] amdgpu: [powerplay] SMU is initialized successfully!

[ 8.396634] snd_hda_intel 0000:09:00.1: bound 0000:09:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

[ 8.481411] fbcon: amdgpudrmfb (fb0) is primary device

[ 8.584900] amdgpu 0000:09:00.0: fb0: amdgpudrmfb frame buffer device

[ 8.591137] amdgpu 0000:09:00.0: ring 0(gfx_0.0.0) uses VM inv eng 4 on hub 0

[ 8.591200] amdgpu 0000:09:00.0: ring 1(gfx_0.1.0) uses VM inv eng 5 on hub 0

[ 8.591223] amdgpu 0000:09:00.0: ring 2(comp_1.0.0) uses VM inv eng 6 on hub 0

[ 8.591246] amdgpu 0000:09:00.0: ring 3(comp_1.1.0) uses VM inv eng 7 on hub 0

[ 8.591271] amdgpu 0000:09:00.0: ring 4(comp_1.2.0) uses VM inv eng 8 on hub 0

[ 8.591294] amdgpu 0000:09:00.0: ring 5(comp_1.3.0) uses VM inv eng 9 on hub 0

[ 8.591317] amdgpu 0000:09:00.0: ring 6(comp_1.0.1) uses VM inv eng 10 on hub 0

[ 8.591341] amdgpu 0000:09:00.0: ring 7(comp_1.1.1) uses VM inv eng 11 on hub 0

[ 8.591364] amdgpu 0000:09:00.0: ring 8(comp_1.2.1) uses VM inv eng 12 on hub 0

[ 8.591387] amdgpu 0000:09:00.0: ring 9(comp_1.3.1) uses VM inv eng 13 on hub 0

[ 8.591411] amdgpu 0000:09:00.0: ring 10(kiq_2.1.0) uses VM inv eng 14 on hub 0

[ 8.591436] amdgpu 0000:09:00.0: ring 11(sdma0) uses VM inv eng 15 on hub 0

[ 8.591469] amdgpu 0000:09:00.0: ring 12(sdma1) uses VM inv eng 16 on hub 0

[ 8.591502] amdgpu 0000:09:00.0: ring 13(vcn_dec) uses VM inv eng 4 on hub 1

[ 8.591536] amdgpu 0000:09:00.0: ring 14(vcn_enc0) uses VM inv eng 5 on hub 1

[ 8.591570] amdgpu 0000:09:00.0: ring 15(vcn_enc1) uses VM inv eng 6 on hub 1

[ 8.591604] amdgpu 0000:09:00.0: ring 16(vcn_jpeg) uses VM inv eng 7 on hub 1

[ 8.591778] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:09:00.0 on minor 0

[ 1067.130096] WARNING: CPU: 0 PID: 1385 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2926 dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu] [ 1067.130097] Modules linked in: fuse cfg80211 8021q garp mrp stp llc efivarfs ipv6 nls_iso8859_1 nls_cp437 vfat fat hid_apple snd_usb_audio snd_usbmidi_lib snd_rawmidi xpad snd_seq_device evdev joydev mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amdgpu snd_hda_codec_hdmi amd_iommu_v2 gpu_sched ttm snd_hda_intel snd_intel_nhlt drm_kms_helper snd_hda_codec snd_hda_core drm snd_hwdep kvm r8169 snd_pcm agpgart realtek irqbypass i2c_algo_bit snd_timer fb_sys_fops crct10dif_pclmul crc32_pclmul snd syscopyarea sysfillrect ghash_clmulni_intel wmi_bmof sysimgblt soundcore k10temp i2c_piix4 libphy ccp thermal gpio_amdpt gpio_generic button acpi_cpufreq loop jfs hid_multitouch hid_microsoft hid_lenovo hid_logitech_hidpp hid_logitech_dj hid_logitech hid_cherry hid_asus asus_wmi battery sparse_keymap rfkill wmi video hwmon hid_generic i2c_hid i2c_core usbhid hid uhci_hcd ohci_pci ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd

[ 1067.130210] RIP: 0010:dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 1067.130296] dc_validate_global_state+0x29a/0x320 [amdgpu]

[ 1067.130379] amdgpu_dm_atomic_check+0x602/0x890 [amdgpu]

[ 1067.130562] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]

[ 5225.481621] WARNING: CPU: 11 PID: 1385 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2926 dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 5225.481621] Modules linked in: fuse cfg80211 8021q garp mrp stp llc efivarfs ipv6 nls_iso8859_1 nls_cp437 vfat fat hid_apple snd_usb_audio snd_usbmidi_lib snd_rawmidi xpad snd_seq_device evdev joydev mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amdgpu snd_hda_codec_hdmi amd_iommu_v2 gpu_sched ttm snd_hda_intel snd_intel_nhlt drm_kms_helper snd_hda_codec snd_hda_core drm snd_hwdep kvm r8169 snd_pcm agpgart realtek irqbypass i2c_algo_bit snd_timer fb_sys_fops crct10dif_pclmul crc32_pclmul snd syscopyarea sysfillrect ghash_clmulni_intel wmi_bmof sysimgblt soundcore k10temp i2c_piix4 libphy ccp thermal gpio_amdpt gpio_generic button acpi_cpufreq loop jfs hid_multitouch hid_microsoft hid_lenovo hid_logitech_hidpp hid_logitech_dj hid_logitech hid_cherry hid_asus asus_wmi battery sparse_keymap rfkill wmi video hwmon hid_generic i2c_hid i2c_core usbhid hid uhci_hcd ohci_pci ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd

[ 5225.481683] RIP: 0010:dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 5225.481728] dc_validate_global_state+0x29a/0x320 [amdgpu]

[ 5225.481768] amdgpu_dm_atomic_check+0x602/0x890 [amdgpu]

[ 5225.481863] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]

[ 8592.552453] WARNING: CPU: 10 PID: 1385 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2926 dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 8592.552454] Modules linked in: fuse cfg80211 8021q garp mrp stp llc efivarfs ipv6 nls_iso8859_1 nls_cp437 vfat fat hid_apple snd_usb_audio snd_usbmidi_lib snd_rawmidi xpad snd_seq_device evdev joydev mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amdgpu snd_hda_codec_hdmi amd_iommu_v2 gpu_sched ttm snd_hda_intel snd_intel_nhlt drm_kms_helper snd_hda_codec snd_hda_core drm snd_hwdep kvm r8169 snd_pcm agpgart realtek irqbypass i2c_algo_bit snd_timer fb_sys_fops crct10dif_pclmul crc32_pclmul snd syscopyarea sysfillrect ghash_clmulni_intel wmi_bmof sysimgblt soundcore k10temp i2c_piix4 libphy ccp thermal gpio_amdpt gpio_generic button acpi_cpufreq loop jfs hid_multitouch hid_microsoft hid_lenovo hid_logitech_hidpp hid_logitech_dj hid_logitech hid_cherry hid_asus asus_wmi battery sparse_keymap rfkill wmi video hwmon hid_generic i2c_hid i2c_core usbhid hid uhci_hcd ohci_pci ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd

[ 8592.552566] RIP: 0010:dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 8592.552652] dc_validate_global_state+0x29a/0x320 [amdgpu]

[ 8592.552735] amdgpu_dm_atomic_check+0x602/0x890 [amdgpu]

[ 8592.552919] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]

[ 9722.665382] WARNING: CPU: 8 PID: 1385 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2926 dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 9722.665383] Modules linked in: fuse cfg80211 8021q garp mrp stp llc efivarfs ipv6 nls_iso8859_1 nls_cp437 vfat fat hid_apple snd_usb_audio snd_usbmidi_lib snd_rawmidi xpad snd_seq_device evdev joydev mc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amdgpu snd_hda_codec_hdmi amd_iommu_v2 gpu_sched ttm snd_hda_intel snd_intel_nhlt drm_kms_helper snd_hda_codec snd_hda_core drm snd_hwdep kvm r8169 snd_pcm agpgart realtek irqbypass i2c_algo_bit snd_timer fb_sys_fops crct10dif_pclmul crc32_pclmul snd syscopyarea sysfillrect ghash_clmulni_intel wmi_bmof sysimgblt soundcore k10temp i2c_piix4 libphy ccp thermal gpio_amdpt gpio_generic button acpi_cpufreq loop jfs hid_multitouch hid_microsoft hid_lenovo hid_logitech_hidpp hid_logitech_dj hid_logitech hid_cherry hid_asus asus_wmi battery sparse_keymap rfkill wmi video hwmon hid_generic i2c_hid i2c_core usbhid hid uhci_hcd ohci_pci ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd

[ 9722.665511] RIP: 0010:dcn20_validate_bandwidth+0xb1/0xd0 [amdgpu]

[ 9722.665611] dc_validate_global_state+0x29a/0x320 [amdgpu]

[ 9722.665708] amdgpu_dm_atomic_check+0x602/0x890 [amdgpu]

[ 9722.665922] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]

[11298.840961] amdgpu: [powerplay] smu driver if version = 0x00000033, smu fw if version = 0x00000037, smu fw version = 0x002a3d00 (42.61.0)

[11298.840961] amdgpu: [powerplay] SMU driver if version not matched

[11298.841969] amdgpu: [powerplay] SMU is initialized successfully! `

sibradzic commented 3 years ago
[11298.840961] amdgpu: [powerplay] smu driver if version = 0x00000033, smu fw if version = 0x00000037, smu fw version = 0x002a3d00 (42.61.0)

I suggest you try and see if the issue happens against the latest kernel and the latest firmware:

git clone --depth=1 git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
sudo cp -ar /usr/lib/firmware/amdgpu /usr/lib/firmware/_amdgpu
sudo cp linux-firmware/amdgpu/* /usr/lib/firmware/amdgpu/
sudo update-initramfs -k all -u